# Dreamer: Model-Based RL with Latent Dynamics Dreamer is a model-based reinforcement learning algorithm that learns a latent dynamics model from images and trains a behavior policy entirely in the latent space. Based on papers: - [Dreamer: Learning Latent Dynamics for Planning from Pixels](https://arxiv.org/abs/1912.01603) (DreamerV1, Hafner et al., 2019) - [Mastering Atari with Discrete World Models](https://arxiv.org/abs/2010.02193) (DreamerV2, Hafner et al., 2020) ```{contents} Contents :depth: 3 ``` ## Overview Dreamer learns a world model from image observations, then trains an actor-critic policy entirely in the imagination of that world model. No gradients flow from the environment to the policy — the world model is the only bridge between real experience and learned behavior. The family has two major versions, documented individually below. --- ## DreamerV1 ### Theory #### Recurrent State-Space Model (RSSM) with Gaussian Latents DreamerV1's RSSM maintains a hybrid state with two components: **1. Deterministic state** `h_t` — a GRU hidden state that captures temporal dependencies and deterministic transitions: ```{math} h_t = \text{GRU}(h_{t-1}, [s_{t-1}, a_{t-1}]) ``` **2. Stochastic state** `s_t` — a diagonal Gaussian latent variable with `stoch_size` means and variances, representing uncertainty. The model operates in two modes: **Observe mode** (training — uses real observations): ```{math} \text{Posterior: } s_t \sim q(s_t | h_t, \text{enc}(x_t)) ``` **Imagine mode** (policy training — no observations): ```{math} \text{Prior: } s_t \sim p(s_t | h_t) ``` #### World Model Loss The complete world model objective (V1): ```{math} \mathcal{L}_{\text{WM}} = \mathcal{L}_{\text{pred}} + \beta \cdot \mathcal{L}_{\text{KL}} ``` Where: ```{math} \begin{aligned} \mathcal{L}_{\text{pred}} &= \underbrace{\|x_t - \hat{x}_t\|^2}_{\text{image reconstruction}} + \underbrace{\|r_t - \hat{r}_t\|^2}_{\text{reward prediction}} \\ \mathcal{L}_{\text{KL}} &= D_{\text{KL}}\big(q(s_t | h_t, e_t) \;\|\; p(s_t | h_t)\big) \end{aligned} ``` V1 applies a single KL coefficient `β` without balancing. #### Actor-Critic in Imagination DreamerV1 rolls out imagined trajectories using the prior dynamics and trains actor-critic purely in latent space: **Actor loss** (REINFORCE with baseline): ```{math} \mathcal{L}_{\text{actor}} = -\sum_{t=1}^{T} \log \pi(a_t | s_t) \cdot \text{sg}(G_t^\lambda - V(s_t)) + \eta \cdot H[\pi(\cdot | s_t)] ``` **Critic loss**: ```{math} \mathcal{L}_{\text{critic}} = \sum_{t=1}^{T} \|V(s_t) - G_t^\lambda\|^2 ``` **Lambda return** (with fixed `γ`): ```{math} G_t^\lambda = r_t + \gamma \cdot \begin{cases} (1 - \lambda) V(s_{t+1}) + \lambda G_{t+1}^\lambda & \text{if } t < T \\ V(s_T) & \text{if } t = T \end{cases} ``` ### Examples ```python import torchwm agent = torchwm.create_model( "dreamer", env_backend="dmc", env="walker-walk", total_steps=5_000_000, ) agent.train() ``` Explicit V1 config: ```python from torchwm import DreamerAgent, DreamerConfig cfg = DreamerConfig() # Select DreamerV1 cfg.algo = "Dreamerv1" # Gaussian latent (V1 default) cfg.stoch_size = 30 # diagonal Gaussian dimensions cfg.deter_size = 200 # Environment cfg.env_backend = "dmc" cfg.env = "walker-walk" cfg.total_steps = 5_000_000 # KL (single coefficient) cfg.kl_loss_coeff = 1.0 agent = DreamerAgent(cfg) agent.train() ``` ```bash torchwm train dreamer --env dmc/walker-walk --algo Dreamerv1 --device cuda ``` --- ## DreamerV2 ### Theory #### Recurrent State-Space Model (RSSM) with Categorical Latents DreamerV2's RSSM maintains the same hybrid state structure as V1 but replaces Gaussian latents with **discrete categorical latents**: ```{math} h_t = \text{GRU}(h_{t-1}, [s_{t-1}, a_{t-1}]) ``` **Stochastic state** `s_t` — a concatenation of `num_categories` one-hot categorical distributions, each with `classes` categories: ```python # V2: stack of categoricals (e.g., 32 classes × 32 categories) self.stoch = torch.cat([one_hot(logits[i]) for i in range(num_categories)], dim=-1) ``` Default: 32 categories × 32 classes = 1024 total latent dimensions. Discrete latents are better at representing multimodal posteriors and are critical for handling aleatoric uncertainty in complex environments like Atari. #### World Model Loss with KL Balancing V2 introduces **KL balancing** — separate weighting for the prior and posterior KL terms using stop-gradient (`sg`): ```{math} \mathcal{L}_{\text{KL}} = \alpha \cdot D_{\text{KL}}[q \| \text{sg}(p)] + (1 - \alpha) \cdot D_{\text{KL}}[\text{sg}(q) \| p] ``` where `α` (default 0.8) weights the prior-following term higher. This prevents the posterior from collapsing to a deterministic point mass. **Free nats**: A threshold (default 3 nats) below which KL is not penalized. The full world model objective (V2): ```{math} \mathcal{L}_{\text{WM}} = \mathcal{L}_{\text{pred}} + \mathcal{L}_{\text{KL}} ``` ```{math} \mathcal{L}_{\text{pred}} = \|x_t - \hat{x}_t\|^2 + \|r_t - \hat{r}_t\|^2 + \text{BCE}(\gamma_t, \hat{\gamma}_t) ``` #### Discount Head V2 adds a learned **discount (termination) head** that predicts episode continuation probability `γ̂_t` via binary cross-entropy: ```{math} \mathcal{L}_{\text{disc}} = \text{BCE}(\gamma_t, \hat{\gamma}_t) ``` This is critical for Atari where episodes can end due to life loss, so the discount factor must be learned rather than fixed. #### Architecture Improvements - **Layer normalization** in GRU and MLP layers for training stability - **SiLU activations** replace ELU throughout - **Two-hot reward encoding** replaces MSE: discretizes reward into 255 bins and predicts a softmax distribution over bins #### Actor-Critic in Imagination Same structure as V1 but uses the learned discount `γ̂_t` in λ-returns: ```{math} G_t^\lambda = r_t + \hat{\gamma}_t \cdot \begin{cases} (1 - \lambda) V(s_{t+1}) + \lambda G_{t+1}^\lambda & \text{if } t < T \\ V(s_T) & \text{if } t = T \end{cases} ``` ### Examples ```python import torchwm agent = torchwm.create_model( "dreamer", env_backend="atari", env="PongNoFrameskip-v4", algo="Dreamerv2", total_steps=10_000_000, ) agent.train() ``` Explicit V2 config: ```python from torchwm import DreamerAgent, DreamerConfig cfg = DreamerConfig() # Select DreamerV2 cfg.algo = "Dreamerv2" # Categorical latent (V2) cfg.stoch_size = 32 # number of categorical classes per category cfg.num_categories = 32 # number of categorical distributions cfg.deter_size = 200 # Environment cfg.env_backend = "atari" cfg.env = "PongNoFrameskip-v4" cfg.total_steps = 10_000_000 # KL balancing cfg.kl_alpha = 0.8 cfg.free_nats = 3.0 # Discount (V2 uses learned termination) cfg.discount = 0.997 agent = DreamerAgent(cfg) agent.train() ``` ```bash torchwm train dreamer --env atari/PongNoFrameskip-v4 --algo Dreamerv2 --device cuda ``` --- ## Differences Between DreamerV1 and DreamerV2 | Aspect | DreamerV1 | DreamerV2 | |--------|-----------|-----------| | **Latent type** | Gaussian (continuous) | One-hot categorical (discrete) | | **Stochastic state** | `stoch_size` diagonal Gaussian | `num_categories` × `classes` categoricals | | **KL formulation** | Single coefficient `β` | KL balancing with `α` weight + free nats | | **Discount** | Fixed `γ` throughout episode | Learned termination predictor `γ̂_t` | | **Reward loss** | MSE | Two-hot discretized cross-entropy | | **Activations** | ELU | SiLU | | **Normalization** | None in GRU/MLP | LayerNorm in GRU and MLP layers | | **Atari performance** | ~40% human-normalized score | ~100% human-normalized score | | **Key advantage** | Simpler, fewer hyperparameters | Better on complex/discrete environments | ### Categorical vs Gaussian Latents V1 uses a diagonal Gaussian for the stochastic state. V2 uses a concatenation of one-hot categorical distributions: ```python # V1: single Gaussian self.stoch = torch.distributions.Normal(mean, std) # V2: stack of categoricals self.stoch = torch.cat([one_hot(logits[i]) for i in range(num_categories)], dim=-1) ``` Discrete latents better capture multimodal posteriors (e.g., "the robot could be at door A or door B") and are less prone to posterior collapse. ### KL Balancing | Formulation | V1 | V2 | |-------------|----|----| | KL loss | `β · KL[q ‖ p]` | `α · KL[q ‖ sg(p)] + (1-α) · KL[sg(q) ‖ p]` | | Stop-gradient | None | On prior in first term, posterior in second | | Effect | Single trade-off | Prior learns to follow posterior; posterior doesn't collapse | ### Discount Head | | V1 | V2 | |---|----|----| | Discount | Fixed scalar `γ=0.99` | Learned `γ̂_t` from BCE loss | | Purpose | Simple time discount | Model episode termination (life loss in Atari) | --- ## Shared Architecture ### High-level diagram

World Model RSSM

Encoder CNN 64x64 GRU plus stochastic latent model Decoder transposed CNN

Imagination Rollout

State s0 Action a0 Imagined future states Lambda-return target

Actor-Critic Learning

Actor policy Critic value model
### Detailed architecture (RSSM) ```{mermaid} graph TD A["Image x_t"] --> B["ConvEncoder"] B --> C["Obs embed e_t"] D["Prev state h_{t-1}, s_{t-1}"] --> E["GRU (deter)"] F["Prev action a_{t-1}"] --> E E --> G["h_t (deterministic)"] G --> H["Prior model p(s_t | h_t)"] H --> I["s_t (stochastic prior)"] C --> J["Posterior model q(s_t | h_t, e_t)"] G --> J J --> K["s_t (stochastic posterior)"] K --> L["ConvDecoder → x̂_t"] K --> M["Reward head → r̂_t"] K --> N["Discount head → γ̂_t (V2 only)"] ``` ### Recurrent State-Space Model (RSSM) The core of Dreamer is the RSSM, defined in `world_models.models.dreamer_rssm.RSSM`. It maintains a hybrid state with two components: **1. Deterministic state** `h_t` — a GRU hidden state that captures temporal dependencies and deterministic transitions: ```{math} h_t = \text{GRU}(h_{t-1}, [s_{t-1}, a_{t-1}]) ``` **2. Stochastic state** `s_t` — a latent variable representing uncertainty. - **V1:** Diagonal Gaussian with `stoch_size` means and variances. - **V2:** Concatenation of `num_categories` one-hot categoricals, each with `classes` categories. Default: 32 classes × 32 categories = 1024 total. The model operates in two modes: **Observe mode** (training — uses real observations): ```{math} \text{Posterior: } s_t \sim q(s_t | h_t, \text{enc}(x_t)) ``` **Imagine mode** (policy training — no observations): ```{math} \text{Prior: } s_t \sim p(s_t | h_t) ``` Key insight: the prior learns to predict the posterior without seeing the observation. During imagination, the prior serves as the dynamics model. ### CNN Encoder (`world_models.vision.dreamer_encoder.ConvEncoder`) Four-layer CNN with increasing channels (32 → 64 → 128 → 256) and ReLU activations. Strided convolutions (stride 2) halve spatial resolution at each layer. Output is flattened to `obs_embed_size` (default 1024). ``` Input: (3, 64, 64) └─ Conv2D(3, 32, 4×4, stride 2) → (32, 31, 31) └─ Conv2D(32, 64, 4×4, stride 2) → (64, 14, 14) └─ Conv2D(64, 128, 4×4, stride 2) → (128, 6, 6) └─ Conv2D(128, 256, 4×4, stride 2) → (256, 2, 2) └─ Flatten → Linear(1024, embed_size) Output: embed_size-d vector ``` ### CNN Decoder (`world_models.vision.dreamer_decoder.ConvDecoder`) Mirrored transposed-CNN structure: ``` Input: stoch + deter state (e.g. 1030-d) └─ Linear(1030, 1024) → reshape to (256, 2, 2) └─ ConvT2D(256, 128, 5×5, stride 2) → (128, 6, 6) └─ ConvT2D(128, 64, 5×5, stride 2) → (64, 14, 14) └─ ConvT2D(64, 32, 6×6, stride 2) → (32, 31, 31) └─ ConvT2D(32, 3, 6×6, stride 2) → (3, 64, 64) Output: reconstructed image ``` ### Reward and Discount Heads (`DenseDecoder`) Two-layer MLPs with ELU activations predicting scalar reward, and in V2, episode discount (termination probability). The discount head is trained with binary cross-entropy: ```{math} \mathcal{L}_{\text{disc}} = \text{BCE}(\gamma_t, \hat{\gamma}_t) ``` ### Action Decoder (`ActionDecoder`) Outputs the policy distribution over actions. For continuous actions, predicts a tanh-squashed Gaussian. For discrete actions, predicts a categorical distribution. Uses REINFORCE gradient through the world model. ## Shared Training ### Training Loop DreamerAgent follows a cyclic training loop: 1. **Collect**: Interact with environment using current policy (+ exploration noise). Store experience in `ReplayBuffer`. 2. **Train world model** (every step): Sample batch of `batch_size` sequences of length `train_seq_len` from buffer. Update encoder, RSSM, decoder, reward head, and discount head. 3. **Train actor-critic** (every step after `seed_steps`): Imagine `imagine_horizon` steps using prior dynamics. Compute λ-returns. Update actor and critic. 4. **Log**: Metrics, video reconstructions, and checkpointing. ```{math} \begin{aligned} &\text{for each environment step:} \\ &\quad \text{collect } (x_t, a_t, r_t, \gamma_t) \\ &\quad \text{if } step > seed\_steps: \\ &\qquad \text{sample batch from buffer} \\ &\qquad \text{update world model (encoder, RSSM, decoder, reward)} \\ &\qquad \text{imagine } H \text{ steps} \\ &\qquad \text{update actor, critic} \\ &\quad \text{log every } log\_every \text{ steps} \end{aligned} ``` ## Usage in TorchWM ### Using config directly ```python from torchwm import DreamerAgent, DreamerConfig cfg = DreamerConfig() cfg.env_backend = "dmc" cfg.env = "walker-walk" cfg.total_steps = 5_000_000 agent = DreamerAgent(cfg) agent.train() ``` ### Environment backends | Parameter | Default | Description | |-----------|---------|-------------| | `stoch_size` | 30 | Stochastic latent dimensions | | `deter_size` | 200 | Deterministic hidden size | | `embed_size` | 1024 | Encoder embedding size | | `imagine_horizon` | 15 | Imagination rollout length | | `discount` | 0.99 | Discount factor γ | | `td_lambda` | 0.95 | λ-return parameter | | `kl_loss_coeff` | 1.0 | KL divergence weight | ### Learning Objectives **World Model Loss**: ```{math} \begin{aligned} \mathcal{L}_\mathrm{world} &= \mathcal{L}_\mathrm{reconstruction} + \mathcal{L}_\mathrm{reward} + \beta \cdot \mathcal{L}_\mathrm{KL} \end{aligned} ``` **Actor Loss** (REINFORCE): ```{math} \mathcal{L}_\mathrm{actor} = -\mathbb{E}\left[\log \pi(\mathbf{a} \mid \mathbf{s}) \cdot (G - V(\mathbf{s}))\right] ``` **Critic Loss** (MSE): ```{math} \mathcal{L}_\mathrm{critic} = \mathbb{E}[(G - V(\mathbf{s}))^2] ``` ## DreamerV2 Enhancements DreamerV2 introduces several improvements: 1. **Discrete latents**: Categorical latent variables instead of Gaussian 2. **KL balancing**: Separate weighting for prior/posterior KL 3. **Discount model**: Learns to predict episode termination 4. **Layer normalization**: More stable training ## Configuration and Checkpoints Dreamer configs are serializable, so experiments can be reproduced from the YAML saved with each run or checkpoint: ```python from world_models.configs import DreamerConfig from world_models.models import DreamerAgent cfg = DreamerConfig() cfg.env = "walker-walk" cfg.to_yaml("configs/dreamer_walker.yaml") agent = DreamerAgent.from_config("configs/dreamer_walker.yaml", seed=7) agent.train() # Checkpoints save `config.yaml` beside the weights automatically. agent.dreamer.save("runs/walker/ckpts/model.pt") restored = DreamerAgent.from_pretrained("runs/walker/ckpts") print(restored.summary()["total_parameters"]) ``` For lower-level workflows, the core `Dreamer` class also supports `Dreamer.from_config(...)`, `Dreamer.from_pretrained(...)`, `Dreamer.summary()`, and `Dreamer.parameter_count()`. ## Environment Support Dreamer supports multiple backends: ```python cfg = DreamerConfig() cfg.env_backend = "dmc" # DeepMind Control Suite cfg.env = "walker-walk" cfg.env_backend = "gym" # Gym/Gymnasium cfg.env = "Pendulum-v1" cfg.env_backend = "dmlab" # DeepMind Lab cfg.env = "rooms_collect_good_objects_train" cfg.dmlab_action_repeat = 4 cfg.env_backend = "mujoco" # MuJoCo cfg.env = "Humanoid-v4" cfg.env_backend = "brax" # JAX/Brax cfg.env = "ant" cfg.env_backend = "procgen" # Procgen cfg.env = "coinrun" cfg.env_backend = "unity_mlagents" # Unity ML-Agents cfg.unity_file_name = "env.exe" ``` ### CLI ```bash torchwm train dreamer --env dmc/walker-walk --device cuda ``` ## Config Reference All configuration is in `world_models.configs.dreamer_config.DreamerConfig`: ```python from world_models.configs.dreamer_config import DreamerConfig config = DreamerConfig() # Dreamer version config.algo = "Dreamerv1" # or "Dreamerv2" (default: "Dreamerv1") # Environment config.env_backend = "dmc" config.env = "walker-walk" config.image_size = (64, 64) # Model architecture config.stoch_size = 30 config.deter_size = 200 config.obs_embed_size = 1024 # Training config.total_steps = 5_000_000 config.batch_size = 50 config.train_seq_len = 50 config.imagine_horizon = 15 config.model_learning_rate = 6e-4 # Actor-critic config.actor_learning_rate = 8e-5 config.value_learning_rate = 8e-5 config.discount = 0.99 config.td_lambda = 0.95 # KL (V2) config.kl_alpha = 0.8 config.free_nats = 3.0 # Exploration config.action_noise = 0.3 # Logging config.scalar_freq = 10_000 config.checkpoint_interval = 100_000 config.enable_wandb = False ``` ### Key Hyperparameters #### World Model | Parameter | V1 Default | V2 Default | Effect | |-----------|------------|------------|--------| | `stoch_size` | 30 | 32 × 32 classes | Total stochastic capacity | | `deter_size` | 200 | 200 | GRU hidden size | | `model_learning_rate` | 6e-4 | 3e-4 | World model learning rate | | `train_seq_len` | 50 | 50 | Sequence length per batch | | `batch_size` | 50 | 16 | Sequences per batch | | `free_nats` | 3.0 | 3.0 | KL free bits threshold | #### Actor-Critic | Parameter | V1 Default | V2 Default | Effect | |-----------|------------|------------|--------| | `actor_learning_rate` | 8e-5 | 8e-5 | Policy learning rate | | `value_learning_rate` | 8e-5 | 8e-5 | Critic learning rate | | `imagine_horizon` | 15 | 15 | Imagination rollout length | | `discount` | 0.99 | 0.997 | Discount factor | | `td_lambda` | 0.95 | 0.95 | λ-return parameter | | `kl_loss_coeff` | 1.0 | 1.0 | KL loss coefficient | | `kl_alpha` | — | 0.8 | KL balancing weight (V2 only) | #### Environment Interaction | Parameter | Default | Effect | |-----------|---------|--------| | `action_repeat` | 2 | Repeat each action N times | | `action_noise` | 0.3 | Exploration noise std | | `seed_steps` | 5000 | Random steps before training | | `total_steps` | 5e6 | Total environment steps | | `collect_steps` | 1000 | Steps between model updates | ## Common Pitfalls ### Posterior collapse If the stochastic state is ignored by the dynamics, the model reduces to a deterministic RNN. Symptoms: good reconstruction but imagination diverges. **Fixes:** - Increase `kl_loss_coeff` or adjust `kl_alpha` (V2) - Decrease `free_nats` - Reduce `stoch_size` ### Imagination divergence The prior predicts states that drift from realistic latents over long horizons. **Fixes:** - Keep `imagine_horizon` short (10–15) - Verify multi-step prediction, not just one-step ### NaN loss during training **Fixes:** - Reduce `model_learning_rate` to 1e-4 - Tighten gradient clipping (default 100 → 10) - Enable layer norm ### Actor never improves **Fixes:** - Increase `imagine_horizon` for delayed rewards - Increase exploration noise via `action_noise` - Verify critic loss is decreasing ## References - Hafner, D., Lillicrap, T., Fischer, I., Vuong, Q., Held, D., Haarnoja, T., & Abbeel, P. (2019). Dreamer: Learning Latent Dynamics for Planning from Pixels. - Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Mastering Atari with Discrete World Models. - Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with Discrete World Models (DreamerV2). *ICLR 2021.*