Dreamer: Model-Based RL with Latent Dynamics#

Dreamer is a model-based reinforcement learning algorithm that learns a latent dynamics model from images and trains a behavior policy entirely in the latent space.

Based on papers:

Key Idea#

Dreamer learns:

  1. World Model: Latent dynamics model that predicts future latent states

  2. Value Model: Estimates expected returns from any latent state

  3. Policy: Actions that maximize expected returns in latent space

The key innovation is learning behaviors purely in imagination - no gradients flow from the environment.

Architecture#

┌─────────────────────────────────────────────────────────────────────┐
│                         World Model (RSSM)                          │
│                                                                      │
│  ┌─────────┐    ┌───────────────────┐    ┌────────────────────────┐ │
│  │Encoder  │    │    Latent Model   │    │       Decoder          │ │
│  │ (CNN)   │ -> │  (GRU + Stoch)    │ -> │    (Transposed CNN)    │ │
│  │ 64x64   │    │ h_t = f(h_{t-1},  │    │                        │ │
│  │         │    │        s_{t-1}, a)│    │  p(x_t | s_t, h_t)    │ │
│  └─────────┘    └───────────────────┘    └────────────────────────┘ │
│                                                                      │
│        s_t ~ p(s_t | h_t)       h_t ~ p(h_t | s_{t-1}, a_{t-1})    │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         Imagination Rollout                         │
│                                                                      │
│  s_0 ──► a_0 ──► s_1 ──► a_1 ──► s_2 ──► ... ──► s_H               │
│  │        │                                                            │
│  └────────┴──────────────────────────────────────┐                  │
│                                               ▼                     │
│                      ┌─────────────────────────────┐                │
│                      │      λ-return target        │                │
│                      │  G_t = r_t + γ(1-λ)v + λG_{t+1}               │
│                      └─────────────────────────────┘                │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                        Actor-Critic Learning                        │
│                                                                      │
│  Actor: π(a_t | s_t, h_t)  ──►  REINFORCE with baseline             │
│  Critic: v(s_t, h_t)       ──►  MSE on λ-returns                   │
└─────────────────────────────────────────────────────────────────────┘

Components#

1. Recurrent State Space Model (RSSM)#

The core world model combining:

  • Deterministic hidden state (h_t): Recurrent state (GRU)

  • Stochastic latent state (s_t): Discrete or continuous latent variables

Dynamics: h_t = f(h_{t-1}, s_{t-1}, a_{t-1}) Posterior: s_t ~ q(s_t | h_t, x_t) Prior: s_t ~ p(s_t | h_t)

2. Encoder/Decoder#

  • Encoder: CNN that maps images to latent embeddings

  • Decoder: Transposed CNN that reconstructs images from latents

  • Both use ReLU activations and residual connections

3. Reward/Discount Heads#

  • Reward model: Predicts reward from latent state

  • Discount model: Predicts episode termination (DreamerV2)

Training#

from world_models.models import DreamerAgent
from world_models.configs import DreamerConfig

cfg = DreamerConfig()
cfg.env_backend = "gym"
cfg.env = "Pendulum-v1"
cfg.total_steps = 1_000_000

agent = DreamerAgent(cfg)
agent.train()

Key Hyperparameters#

Parameter

Default

Description

stoch_size

30

Stochastic latent dimensions

deter_size

200

Deterministic hidden size

embed_size

1024

Encoder embedding size

imagine_horizon

15

Imagination rollout length

discount

0.99

Discount factor γ

td_lambda

0.95

λ-return parameter

kl_loss_coeff

1.0

KL divergence weight

Learning Objectives#

World Model Loss:

L_world = L_reconstruction + L_reward + β * L_KL

Actor Loss (REINFORCE):

L_actor = -E[log π(a|s) * (G - V(s))]

Critic Loss (MSE):

L_critic = E[(G - V(s))²]

DreamerV2 Enhancements#

DreamerV2 introduces several improvements:

  1. Discrete latents: Categorical latent variables instead of Gaussian

  2. KL balancing: Separate weighting for prior/posterior KL

  3. Discount model: Learns to predict episode termination

  4. Layer normalization: More stable training

Environment Support#

Dreamer supports multiple backends:

cfg = DreamerConfig()
cfg.env_backend = "dmc"      # DeepMind Control Suite
cfg.env = "walker-walk"

cfg.env_backend = "gym"      # Gym/Gymnasium
cfg.env = "Pendulum-v1"

cfg.env_backend = "unity_mlagents"  # Unity ML-Agents
cfg.unity_file_name = "env.exe"

References#

  • Hafner, D., Lillicrap, T., Fischer, I., Vuong, Q., Held, D., Haarnoja, T., & Abbeel, P. (2019). Dreamer: Learning Latent Dynamics for Planning from Pixels.

  • Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Mastering Atari with Discrete World Models.