# Benchmarking World Models

TorchWM includes a lightweight benchmark harness for running standardized
evaluations of trained world-model agents and exporting results that can be used
in experiment logs, reports, and papers.

## Quick Overview

- Preferred CLI entrypoint: `torchwm benchmark`.
- Benchmark adapters live in the TorchWM source tree under `world_models/benchmarks/`.

## Supported adapters

The benchmark CLI currently registers these adapters out of the box:

- `diamond` - DIAMOND diffusion world-model agent
- `iris` - IRIS transformer-based agent
- `dreamerv1` / `dreamerv2` - Dreamer family

Benchmarks are intended for trained models. For single-agent runs, pass a
checkpoint with `--checkpoint`. For multi-agent runs, pass one or more
`--checkpoint-map AGENT=PATH` values, or use `--train-epochs` when you
intentionally want the CLI to train before evaluating.

## TorchWM CLI examples

Run IRIS on Pong for 3 episodes using seed 0 and write the standard result files
to `results/bench`:

```bash
torchwm benchmark \
  --agent iris \
  --game ALE/Pong-v5 \
  --checkpoint checkpoints/iris/pong.pt \
  --seeds 1 \
  --episodes 3
```

Run DIAMOND on Breakout with two explicit seeds and 5 episodes per seed:

```bash
torchwm benchmark \
  --agent diamond \
  --game Breakout-v5 \
  --checkpoint checkpoints/diamond/breakout.pt \
  --seeds 0,1 \
  --episodes 5 \
  --out-dir results/diamond_breakout
```

Run DreamerV2 on a Gym environment:

```bash
torchwm benchmark \
  --agent dreamerv2 \
  --game Pong-v5 \
  --checkpoint checkpoints/dreamerv2/pong.pt \
  --seeds 1 \
  --episodes 10 \
  --device cpu
```

Run all registered adapters on the same game with per-agent checkpoints:

```bash
torchwm benchmark \
  --all-agents \
  --game ALE/Pong-v5 \
  --checkpoint-map iris=checkpoints/iris/pong.pt \
  --checkpoint-map diamond=checkpoints/diamond/pong.pt \
  --checkpoint-map dreamerv1=checkpoints/dreamerv1/pong.pt \
  --checkpoint-map dreamerv2=checkpoints/dreamerv2/pong.pt \
  --seeds 0,1 \
  --episodes 5 \
  --out-dir results/pong_comparison
```

## CLI options

Common `torchwm benchmark` options:

- `--agent AGENT` / `-a AGENT`: run one adapter (`iris`, `diamond`, `dreamerv1`, or `dreamerv2`).
- `--all-agents`: run every registered adapter on the same environment.
- `--game GAME` / `-g GAME`: Gym/ALE environment id, such as `ALE/Pong-v5`.
- `--checkpoint PATH` / `-c PATH`: checkpoint path for single-agent benchmarks.
- `--checkpoint-map AGENT=PATH`: repeatable per-agent checkpoint mapping for `--all-agents`.
- `--seeds SEEDS`: either `N` for seeds `0..N-1`, or a comma-separated list such as `0,1,2`.
- `--episodes N` / `-n N`: number of evaluation episodes per seed.
- `--out-dir DIR`: output directory for report artifacts. The legacy alias `--out_dir` is also accepted.
- `--device DEVICE`: device forwarded to adapters. Defaults to CUDA when available, otherwise CPU.
- `--preset PRESET`: optional adapter/model preset.
- `--train-epochs N`: for `--all-agents`, train first when checkpoint maps are not supplied.

You can also run `torchwm benchmark --help` to see the installed CLI help.

## Python usage

For benchmark runs, prefer the main TorchWM CLI so commands are consistent with
the rest of the package:

```bash
torchwm benchmark --agent iris --game ALE/Pong-v5 --checkpoint checkpoints/iris/pong.pt
```

The command writes benchmark JSON reports under the configured output
directory. Load those reports with standard Python tools when you need custom
analysis:

```py
import json

res = json.load(open("results/bench/benchmark_results.json"))
per_seed = res["aggregate"]["per_seed_means"]
print(per_seed)
```

## Running the Atari 100k benchmark

To run the full Atari 100k benchmark on all configured games with the
centralized benchmark module:

```bash
python -m world_models.benchmarks.atari_100k --benchmark
```

This runs the Atari 100k evaluator from `world_models/benchmarks`, computes
human-normalized scores, and reports aggregate metrics across games and seeds.

## Outputs

The runner saves these files into the selected `out_dir` (default
`results/bench`):

- `benchmark_results.json` - raw structured results.
- `benchmark_results.csv` - per-seed rows.
- `benchmark_results.md` - human-readable markdown table.
- `benchmark_results.tex` - LaTeX table ready for papers.

Multi-agent runs also write combined reports such as
`combined_benchmark_results.json` and `combined_benchmark_results.csv` in the
root output directory, with per-agent details under subdirectories.

## Computing IQM and bootstrap CIs

The runner stores per-seed means in the JSON under
`aggregate.per_seed_means`. Use your preferred statistics package to compute
IQM and confidence intervals from that array.

## Extending the harness

- Create an adapter in `world_models/benchmarks/adapters.py` that implements:
  - `load_checkpoint(path: str)`
  - `evaluate(num_episodes: int, render: bool = False)` returning `{"episode_returns": list[float]}`
- Register your adapter in `world_models/benchmarks/cli.py` to expose it through `torchwm benchmark`.

## Tests and CI

- Place smoke tests under `world_models/benchmarks/tests/` so CI can run them quickly.
- The repo contains a `mocking_classes.py` helper for building fake agents and environments for fast unit tests.

## Where to start

- Run the examples in `examples/benchmark_iris.py` or use `torchwm benchmark` directly.
- If you need help wiring specific agent configs, use `--device`, `--preset`, and checkpoint options, or call the runner programmatically and pass `extra_kwargs`.