Genie: Generative Interactive Environment
Genie is a generative model trained from video-only data that can be used as an
interactive environment for reinforcement learning and decision-making tasks,
without requiring any action labels.
Based on paper: Genie: Generative Interactive Environments (Bruce et al., 2024)
Contents
Genie learns to understand world dynamics from unlabeled videos by learning:
Video Tokenization : Converts raw video frames into discrete tokens
Latent Actions : Infers the underlying actions that caused transitions between frames
Dynamics Prediction : Predicts future frames given past frames and latent actions
This enables agents to imagine and plan in a learned latent action space without
needing explicit action labels.
graph TD
subgraph "Genie"
J["Video frames"] --> K["Video tokenizer"]
K --> L["Video tokens"]
M["Frame pairs (xₜ, xₜ₊₁)"] --> N["Latent action model"]
N --> O["Latent action âₜ"]
L --> P["Dynamics model"]
O --> P
P --> Q["Next video tokens"]
Q --> K --> R["Interactive generation"]
end
Genie Architecture
Video frames
→
Video tokenizer
→
Video tokens
→
Dynamics model
→
Decoded frames
Consecutive frames
→
Latent action model
→
Latent actions
→
Dynamics model
Converts raw video frames into discrete tokens using a VQ-VAE approach with
spatio-temporal downsampling:
Input : ( 3 , 16 , 64 , 64 ) video clip
└─ 3 D convolutions ( spatio - temporal downsampling )
└─ VQ layer ( codebook size : 1024 )
Output : ( 16 , 16 , 16 ) discrete token grid
Total tokens per frame: (64/4) × (64/4) = 16 × 16 = 256 tokens.
Learns to infer discrete latent actions from frame-to-frame transitions
without any supervision:
Input : frame_t , frame_t + 1
└─ Encoder : process both frames
└─ VQ layer : quantize to action token
Output : latent action index ( e . g . , { 0 , ... , 7 })
Training loss:
\[\mathcal{L}_{\text{LAM}} =
\underbrace{\|x_{t+1} - \hat{x}_{t+1}(x_t, \hat{a}_t)\|^2}_{\text{reconstruction}}
+ \underbrace{\|\text{sg}[z_e] - e_k\|^2}_{\text{codebook}}
+ \beta \cdot \underbrace{\|z_e - \text{sg}[e_k]\|^2}_{\text{commitment}}\]
The key insight: the action that best explains the frame transition is the one
that minimizes the reconstruction error of the next frame.
Transformer-based model that predicts future video tokens conditioned on past
tokens and latent actions:
Input : past video tokens + latent action
└─ Transformer ( causal masking )
└─ Token prediction head
Output : next video tokens ( as logits )
Training loss : Cross-entropy on predicted vs. actual tokens.
During generation, the dynamics model uses MaskGIT sampling — an iterative
refinement strategy that is faster than autoregressive decoding:
# MaskGIT sampling (25 steps)
mask = all_masked
for step in range ( maskgit_steps ):
logits = dynamics_model ( tokens , mask , latent_action )
tokens = sample_top_k ( logits , mask )
mask = update_mask ( step ) # gradually unmask
Genie is trained with multiple loss components:
Tokenizer Loss : VQ-VAE reconstruction loss for video tokenization
Latent Action Loss : VQ commitment loss + prediction loss for action learning
Dynamics Loss : Cross-entropy for token prediction with masking
Total Loss = L_tokenizer + λ₁·L_action + λ₂·L_dynamics
from torchwm import GenieConfig , create_genie_small
cfg = GenieConfig ()
cfg . num_frames = 16
cfg . image_size = 64
cfg . epochs = 100
model = create_genie_small ( num_frames = 16 , image_size = 64 )
Generate new video frames from a prompt frame:
prompt_frame = torch . randn ( 1 , 3 , 64 , 64 )
generated = model . generate ( prompt_frame , num_frames = 16 )
Step through the environment using inferred or specified actions:
current_frame = torch . randn ( 1 , 3 , 64 , 64 )
action = torch . tensor ([ 3 ]) # Latent action index
next_frame = model . play ( current_frame , action )
Infer latent actions from real video frames:
frames = torch . randn ( 1 , 3 , 16 , 64 , 64 )
actions = model . infer_actions ( frames )
torchwm train genie --config path/to/genie_config.yaml
from torchwm import GenieConfig , create_genie_small
cfg = GenieConfig ()
cfg . num_frames = 16
cfg . image_size = 64
cfg . epochs = 100
# Create model
model = create_genie_small ( num_frames = 16 , image_size = 64 )
# Key hyperparameters
cfg . tokenizer_vocab_size = 1024 # Video token codebook
cfg . action_vocab_size = 8 # Latent action codebook
cfg . dynamics_dim = 512 # Transformer hidden size
cfg . dynamics_depth = 8 # Transformer layers
cfg . maskgit_steps = 25 # MaskGIT refinement steps
Video-only training : No action labels required
Interactive : Can be used as a simulated environment
Generalizable : Learns from diverse video data
Latent action space : Enables efficient planning
Most codebook entries go unused.
Fixes:
Use EMA codebook updates (default in Genie)
Lower commitment loss weight
Increase codebook dimension
The LAM might learn trivial actions.
Fixes:
Bruce, J., et al. (2024). Genie: Generative Interactive Environments. arXiv:2402.15391.
Van Den Oord, A., & Vinyals, O. (2017). Neural Discrete Representation Learning. NeurIPS 2017.
Chang, H., et al. (2022). MaskGIT: Masked Generative Image Transformer. CVPR 2022.