Genie: Generative Interactive Environment#

Genie is a generative model trained from video-only data that can be used as an interactive environment for reinforcement learning and decision-making tasks.

Based on paper: Genie: Generative Interactive Environments (Bruce et al., 2024)

Key Idea#

Genie learns to understand world dynamics from unlabeled videos by learning:

  1. Video Tokenization: Converts raw video frames into discrete tokens

  2. Latent Actions: Infers the underlying actions that caused transitions between frames

  3. Dynamics Prediction: Predicts future frames given past frames and latent actions

This enables agents to imagine and plan in a learned latent action space without needing explicit action labels.

Architecture#

Genie Architecture

Video frames Video tokenizer Video tokens Dynamics model Decoded frames
Consecutive frames Latent action model Latent actions Dynamics model

Components#

1. Video Tokenizer#

Converts raw video frames into discrete tokens using a VQ-VAE approach:

  • Encoder processes frames into latent representations

  • Quantization layer maps latents to discrete codebook indices

  • Decoder reconstructs video from discrete tokens

Input: (B, C, T, H, W) → Tokens: (B, T, H/patch, W/patch)

2. Latent Action Model (LAM)#

Learns to infer latent actions from video帧 transitions:

  • Encodes pairs of consecutive frames

  • Predicts discrete latent action tokens

  • Uses VQ commitment loss for stable training

Input: (Frame_t, Frame_t+1) → Latent Action Index ∈ {0, ..., V-1}

3. Dynamics Model#

Transformer-based model that predicts future tokens:

  • Autoregressive generation of video tokens

  • Conditioned on latent actions

  • Uses MaskGIT for efficient sampling

Training#

from torchwm import create_genie_small
from torchwm import GenieConfig
# Use your training loop or the TorchWM training CLI for full Genie runs.

cfg = GenieConfig()
cfg.num_frames = 16
cfg.image_size = 64
cfg.epochs = 100

model = create_genie_small(num_frames=16, image_size=64)
# trainer = GenieTrainer(model, cfg)
# trainer.train()

Training Losses#

Genie is trained with multiple loss components:

  1. Tokenizer Loss: VQ-VAE reconstruction loss for video tokenization

  2. Latent Action Loss: VQ commitment loss + prediction loss for action learning

  3. Dynamics Loss: Cross-entropy for token prediction with masking

Total Loss = L_tokenizer + λ₁·L_action + λ₂·L_dynamics

Data Format#

Prepare videos as a dataset with the following structure:

Dataset/
├── videos/
│   ├── video_001.mp4
│   ├── video_002.mp4
│   └── ...

Each video should contain at least num_frames frames. The tokenizer will sample frames uniformly from each video during training.

Key Hyperparameters#

Parameter

Default

Description

num_frames

8

Number of frames per video

image_size

32

Input image size (height/width)

tokenizer_vocab_size

1024

Video token vocabulary size

action_vocab_size

8

Latent action vocabulary size

dynamics_dim

512

Transformer hidden dimension

dynamics_depth

8

Number of transformer layers

dynamics_num_heads

8

Number of attention heads

batch_size

4

Training batch size

learning_rate

3e-5

Learning rate

maskgit_steps

25

Number of MaskGIT sampling steps

warmup_steps

5000

Learning rate warmup steps

max_steps

125000

Total training steps

Usage#

Generation#

Generate new video frames from a prompt frame:

prompt_frame = torch.randn(1, 3, 64, 64)
generated = model.generate(prompt_frame, num_frames=16)

Interactive Play#

Step through the environment using inferred or specified actions:

current_frame = torch.randn(1, 3, 64, 64)
action = torch.tensor([3])  # Latent action index (tensor)
next_frame = model.play(current_frame, action)

Action Inference#

Infer latent actions from real video frames:

frames = torch.randn(1, 3, 16, 64, 64)
actions = model.infer_actions(frames)

Model Variants#

Model

Parameters

Use Case

create_genie_small

~50M

Development/testing

create_genie_large

~11B

Production/research

Comparison to Other Methods#

Method

Input

Output

Use Case

JEPA

Images

Latent predictions

Representation learning

IRIS

Images

Token sequences

World modeling

Genie

Videos

Interactive env

RL agent training

Advantages#

  1. Video-only training: No action labels required

  2. Interactive: Can be used as a simulated environment

  3. Generalizable: Learns from diverse video data

  4. Latent action space: Enables efficient planning

References#

  • Bruce, J., et al. (2024). Genie: Generative Interactive Environments.