Genie: Generative Interactive Environment #

Genie is a generative model trained from video-only data that can be used as an interactive environment for reinforcement learning and decision-making tasks, without requiring any action labels.

Based on paper: Genie: Generative Interactive Environments (Bruce et al., 2024)

Overview #

Genie learns to understand world dynamics from unlabeled videos by learning:

Video Tokenization: Converts raw video frames into discrete tokens
Latent Actions: Infers the underlying actions that caused transitions between frames
Dynamics Prediction: Predicts future frames given past frames and latent actions

This enables agents to imagine and plan in a learned latent action space without needing explicit action labels.

        graph TD
    subgraph "Genie"
        J["Video frames"] --> K["Video tokenizer"]
        K --> L["Video tokens"]
        M["Frame pairs (xₜ, xₜ₊₁)"] --> N["Latent action model"]
        N --> O["Latent action âₜ"]
        L --> P["Dynamics model"]
        O --> P
        P --> Q["Next video tokens"]
        Q --> K --> R["Interactive generation"]
    end

\[\mathcal{L}_{\text{LAM}} = \underbrace{\|x_{t+1} - \hat{x}_{t+1}(x_t, \hat{a}_t)\|^2}_{\text{reconstruction}} + \underbrace{\|\text{sg}[z_e] - e_k\|^2}_{\text{codebook}} + \beta \cdot \underbrace{\|z_e - \text{sg}[e_k]\|^2}_{\text{commitment}}\]

The key insight: the action that best explains the frame transition is the one that minimizes the reconstruction error of the next frame.

3. Dynamics Model #

Transformer-based model that predicts future video tokens conditioned on past tokens and latent actions:

Input:  past video tokens + latent action
  └─ Transformer (causal masking)
  └─ Token prediction head
Output: next video tokens (as logits)

Training loss: Cross-entropy on predicted vs. actual tokens.

During generation, the dynamics model uses MaskGIT sampling — an iterative refinement strategy that is faster than autoregressive decoding:

# MaskGIT sampling (25 steps)
mask = all_masked
for step in range(maskgit_steps):
    logits = dynamics_model(tokens, mask, latent_action)
    tokens = sample_top_k(logits, mask)
    mask = update_mask(step)  # gradually unmask

Training #

Training Losses #

Genie is trained with multiple loss components:

Tokenizer Loss: VQ-VAE reconstruction loss for video tokenization
Latent Action Loss: VQ commitment loss + prediction loss for action learning
Dynamics Loss: Cross-entropy for token prediction with masking

Total Loss = L_tokenizer + λ₁·L_action + λ₂·L_dynamics

Data Format #

Dataset/
├── videos/
│   ├── video_001.mp4
│   ├── video_002.mp4
│   └── ...

Each video should contain at least num_frames frames.

Key Hyperparameters #

Parameter	Default	Description
`num_frames`	8	Number of frames per video
`image_size`	32	Input image size
`tokenizer_vocab_size`	1024	Video token vocabulary size
`action_vocab_size`	8	Latent action vocabulary size
`dynamics_dim`	512	Transformer hidden dimension
`dynamics_depth`	8	Number of transformer layers
`dynamics_num_heads`	8	Number of attention heads
`batch_size`	4	Training batch size
`learning_rate`	3e-5	Learning rate
`maskgit_steps`	25	Number of MaskGIT sampling steps
`warmup_steps`	5000	Learning rate warmup steps
`max_steps`	125000	Total training steps

Usage in TorchWM #

Quick start #

from torchwm import GenieConfig, create_genie_small

cfg = GenieConfig()
cfg.num_frames = 16
cfg.image_size = 64
cfg.epochs = 100

model = create_genie_small(num_frames=16, image_size=64)

Generation #

Generate new video frames from a prompt frame:

prompt_frame = torch.randn(1, 3, 64, 64)
generated = model.generate(prompt_frame, num_frames=16)

Interactive Play #

Step through the environment using inferred or specified actions:

current_frame = torch.randn(1, 3, 64, 64)
action = torch.tensor([3])  # Latent action index
next_frame = model.play(current_frame, action)

Action Inference #

Infer latent actions from real video frames:

frames = torch.randn(1, 3, 16, 64, 64)
actions = model.infer_actions(frames)

CLI #

torchwm train genie --config path/to/genie_config.yaml

Config Reference #

from torchwm import GenieConfig, create_genie_small

cfg = GenieConfig()
cfg.num_frames = 16
cfg.image_size = 64
cfg.epochs = 100

# Create model
model = create_genie_small(num_frames=16, image_size=64)

# Key hyperparameters
cfg.tokenizer_vocab_size = 1024   # Video token codebook
cfg.action_vocab_size = 8         # Latent action codebook
cfg.dynamics_dim = 512            # Transformer hidden size
cfg.dynamics_depth = 8            # Transformer layers
cfg.maskgit_steps = 25            # MaskGIT refinement steps

Model Variants #

Variant	Params	Use Case
`create_genie_small`	~50M	Development, debugging
`create_genie_large`	~11B	Production, research

Comparison: IRIS vs Genie #

Aspect	IRIS	Genie
Actions	Provided by environment (known)	Inferred from video (latent)
Tokenizer	Per-frame VQ-VAE	Spatio-temporal VQ-VAE
Tokens per frame	16	256 (typically)
Dynamics	Autoregressive (GPT)	Autoregressive + MaskGIT
Policy	Actor-critic (REINFORCE)	N/A (interactive gen.)
Data requirement	~100k env steps	~50k+ videos
Use case	Model-based RL	Video world modeling

Advantages #

Video-only training: No action labels required
Interactive: Can be used as a simulated environment
Generalizable: Learns from diverse video data
Latent action space: Enables efficient planning

Common Pitfalls #

Codebook collapse #

Most codebook entries go unused.

Fixes:

Use EMA codebook updates (default in Genie)
Lower commitment loss weight
Increase codebook dimension

Transformer memory #

Sequence: 256 × 16 = 4096 tokens.

Fixes:

Use gradient checkpointing
Use sparse attention patterns

Latent action disentanglement #

The LAM might learn trivial actions.

Fixes:

Increase action codebook size
Add entropy regularization on action distribution

References #

Bruce, J., et al. (2024). Genie: Generative Interactive Environments. arXiv:2402.15391.
Van Den Oord, A., & Vinyals, O. (2017). Neural Discrete Representation Learning. NeurIPS 2017.
Chang, H., et al. (2022). MaskGIT: Masked Generative Image Transformer. CVPR 2022.

Genie: Generative Interactive Environment #

Overview #

Architecture #

High-level diagram #

Genie Architecture

1. Video Tokenizer #

2. Latent Action Model (LAM)#

3. Dynamics Model #

Training #

Training Losses #

Data Format #

Key Hyperparameters #

Usage in TorchWM #

Quick start #

Generation #

Interactive Play #

Action Inference #

CLI #

Config Reference #

Model Variants #

Comparison: IRIS vs Genie #

Advantages #

Common Pitfalls #

Codebook collapse #

Transformer memory #

Latent action disentanglement #

References #