Dreamer: Model-Based RL with Latent Dynamics#
Dreamer is a model-based reinforcement learning algorithm that learns a latent dynamics model from images and trains a behavior policy entirely in the latent space.
Based on papers:
Dreamer: Learning Latent Dynamics for Planning from Pixels (Hafner et al., 2019)
Mastering Atari with Discrete World Models (DreamerV2, Hafner et al., 2020)
Key Idea#
Dreamer learns:
World Model: Latent dynamics model that predicts future latent states
Value Model: Estimates expected returns from any latent state
Policy: Actions that maximize expected returns in latent space
The key innovation is learning behaviors purely in imagination - no gradients flow from the environment.
Architecture#
World Model RSSM
Imagination Rollout
Actor-Critic Learning
Components#
1. Recurrent State Space Model (RSSM)#
The core world model combining:
Deterministic hidden state (h_t): Recurrent state (GRU)
Stochastic latent state (s_t): Discrete or continuous latent variables
Dynamics:
Posterior:
Prior:
2. Encoder/Decoder#
Encoder: CNN that maps images to latent embeddings
Decoder: Transposed CNN that reconstructs images from latents
Both use ReLU activations and residual connections
3. Reward/Discount Heads#
Reward model: Predicts reward from latent state
Discount model: Predicts episode termination (DreamerV2)
Training#
from torchwm import DreamerAgent
from torchwm import DreamerConfig
cfg = DreamerConfig()
cfg.env_backend = "gym"
cfg.env = "Pendulum-v1"
cfg.total_steps = 1_000_000
agent = DreamerAgent(cfg)
agent.train()
Key Hyperparameters#
Parameter |
Default |
Description |
|---|---|---|
|
30 |
Stochastic latent dimensions |
|
200 |
Deterministic hidden size |
|
1024 |
Encoder embedding size |
|
15 |
Imagination rollout length |
|
0.99 |
Discount factor γ |
|
0.95 |
λ-return parameter |
|
1.0 |
KL divergence weight |
Learning Objectives#
World Model Loss:
Actor Loss (REINFORCE):
Critic Loss (MSE):
DreamerV2 Enhancements#
DreamerV2 introduces several improvements:
Discrete latents: Categorical latent variables instead of Gaussian
KL balancing: Separate weighting for prior/posterior KL
Discount model: Learns to predict episode termination
Layer normalization: More stable training
Environment Support#
Dreamer supports multiple backends:
cfg = DreamerConfig()
cfg.env_backend = "dmc" # DeepMind Control Suite
cfg.env = "walker-walk"
cfg.env_backend = "gym" # Gym/Gymnasium
cfg.env = "Pendulum-v1"
# MuJoCo example:
cfg.env_backend = "mujoco" # MuJoCo task ids or native MJCF/MJB files
cfg.env = "Humanoid-v4" # or "models/cartpole.xml"
cfg.mujoco_camera = None # native MJCF/MJB only
cfg.mujoco_frame_skip = 4 # native MJCF/MJB only
# Gymnasium Robotics example (all ids registered by installed package):
cfg.env_backend = "robotics"
cfg.env = "HalfCheetah-v2"
# Brax example:
cfg.env_backend = "brax" # JAX/Brax
cfg.env = "ant"
cfg.brax_backend = "generalized"
cfg.env_backend = "unity_mlagents" # Unity ML-Agents
cfg.unity_file_name = "env.exe"
For MuJoCo tasks, Dreamer delegates adapter construction to
make_mujoco_env_from_config, which keeps make_env focused on backend
selection while the MuJoCo module owns task-id vs XML/MJB source selection. Use
Gymnasium task ids such as Humanoid-v4 for standard benchmark rewards, or use
native MJCF/MJB sources plus MuJoCoImageEnv callbacks for custom rewards and
termination logic. Legacy MuJoCo v2/v3 ids and other Gymnasium Robotics tasks can use env_backend="robotics"; TorchWM lists those ids dynamically from the installed gymnasium-robotics package.
References#
Hafner, D., Lillicrap, T., Fischer, I., Vuong, Q., Held, D., Haarnoja, T., & Abbeel, P. (2019). Dreamer: Learning Latent Dynamics for Planning from Pixels.
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Mastering Atari with Discrete World Models.