Benchmarking World Models#
TorchWM includes a lightweight benchmark harness for running standardized evaluations of trained world-model agents and exporting results that can be used in experiment logs, reports, and papers.
Quick Overview#
Preferred CLI entrypoint:
torchwm benchmark.Benchmark adapters live in the TorchWM source tree under
world_models/benchmarks/.
Supported adapters#
The benchmark CLI currently registers these adapters out of the box:
diamond- DIAMOND diffusion world-model agentiris- IRIS transformer-based agentdreamerv1/dreamerv2- Dreamer family
Benchmarks are intended for trained models. For single-agent runs, pass a
checkpoint with --checkpoint. For multi-agent runs, pass one or more
--checkpoint-map AGENT=PATH values, or use --train-epochs when you
intentionally want the CLI to train before evaluating.
TorchWM CLI examples#
Run IRIS on Pong for 3 episodes using seed 0 and write the standard result files
to results/bench:
torchwm benchmark \
--agent iris \
--game ALE/Pong-v5 \
--checkpoint checkpoints/iris/pong.pt \
--seeds 1 \
--episodes 3
Run DIAMOND on Breakout with two explicit seeds and 5 episodes per seed:
torchwm benchmark \
--agent diamond \
--game Breakout-v5 \
--checkpoint checkpoints/diamond/breakout.pt \
--seeds 0,1 \
--episodes 5 \
--out-dir results/diamond_breakout
Run DreamerV2 on a Gym environment:
torchwm benchmark \
--agent dreamerv2 \
--game Pong-v5 \
--checkpoint checkpoints/dreamerv2/pong.pt \
--seeds 1 \
--episodes 10 \
--device cpu
Run all registered adapters on the same game with per-agent checkpoints:
torchwm benchmark \
--all-agents \
--game ALE/Pong-v5 \
--checkpoint-map iris=checkpoints/iris/pong.pt \
--checkpoint-map diamond=checkpoints/diamond/pong.pt \
--checkpoint-map dreamerv1=checkpoints/dreamerv1/pong.pt \
--checkpoint-map dreamerv2=checkpoints/dreamerv2/pong.pt \
--seeds 0,1 \
--episodes 5 \
--out-dir results/pong_comparison
CLI options#
Common torchwm benchmark options:
--agent AGENT/-a AGENT: run one adapter (iris,diamond,dreamerv1, ordreamerv2).--all-agents: run every registered adapter on the same environment.--game GAME/-g GAME: Gym/ALE environment id, such asALE/Pong-v5.--checkpoint PATH/-c PATH: checkpoint path for single-agent benchmarks.--checkpoint-map AGENT=PATH: repeatable per-agent checkpoint mapping for--all-agents.--seeds SEEDS: eitherNfor seeds0..N-1, or a comma-separated list such as0,1,2.--episodes N/-n N: number of evaluation episodes per seed.--out-dir DIR: output directory for report artifacts. The legacy alias--out_diris also accepted.--device DEVICE: device forwarded to adapters. Defaults to CUDA when available, otherwise CPU.--preset PRESET: optional adapter/model preset.--train-epochs N: for--all-agents, train first when checkpoint maps are not supplied.
You can also run torchwm benchmark --help to see the installed CLI help.
Python usage#
For benchmark runs, prefer the main TorchWM CLI so commands are consistent with the rest of the package:
torchwm benchmark --agent iris --game ALE/Pong-v5 --checkpoint checkpoints/iris/pong.pt
The command writes benchmark JSON reports under the configured output directory. Load those reports with standard Python tools when you need custom analysis:
import json
res = json.load(open("results/bench/benchmark_results.json"))
per_seed = res["aggregate"]["per_seed_means"]
print(per_seed)
Running the Atari 100k benchmark#
To run the full Atari 100k benchmark on all configured games with the centralized benchmark module:
python -m world_models.benchmarks.atari_100k --benchmark
This runs the Atari 100k evaluator from world_models/benchmarks, computes
human-normalized scores, and reports aggregate metrics across games and seeds.
Outputs#
The runner saves these files into the selected out_dir (default
results/bench):
benchmark_results.json- raw structured results.benchmark_results.csv- per-seed rows.benchmark_results.md- human-readable markdown table.benchmark_results.tex- LaTeX table ready for papers.
Multi-agent runs also write combined reports such as
combined_benchmark_results.json and combined_benchmark_results.csv in the
root output directory, with per-agent details under subdirectories.
Computing IQM and bootstrap CIs#
The runner stores per-seed means in the JSON under
aggregate.per_seed_means. Use your preferred statistics package to compute
IQM and confidence intervals from that array.
Extending the harness#
Create an adapter in
world_models/benchmarks/adapters.pythat implements:load_checkpoint(path: str)evaluate(num_episodes: int, render: bool = False)returning{"episode_returns": list[float]}
Register your adapter in
world_models/benchmarks/cli.pyto expose it throughtorchwm benchmark.
Tests and CI#
Place smoke tests under
world_models/benchmarks/tests/so CI can run them quickly.The repo contains a
mocking_classes.pyhelper for building fake agents and environments for fast unit tests.
Where to start#
Run the examples in
examples/benchmark_iris.pyor usetorchwm benchmarkdirectly.If you need help wiring specific agent configs, use
--device,--preset, and checkpoint options, or call the runner programmatically and passextra_kwargs.