# Benchmarking World Models

This page documents the lightweight benchmarking harness included in the repository.

Quick Overview
--------------
- Code lives under `world_models/benchmarks/`.
- Entrypoints:
  - CLI: `python -m world_models.benchmarks.cli` (see examples below)
  - Python API: `world_models.benchmarks.runner.BenchmarkRunner`

Supported adapters (out of the box)
----------------------------------
- `diamond` - DIAMOND diffusion world-model agent (`world_models.training.train_diamond.DiamondAgent`)
- `iris` - IRIS transformer-based agent (`world_models.training.train_iris.IRISTrainer`)
- `dreamerv1` / `dreamerv2` - Dreamer family (`world_models.models.dreamer.DreamerAgent`)

CLI Examples
------------

- Run IRIS on Pong for 3 episodes using seed 0 (writes results to `results/bench` by default):

```
python -m world_models.benchmarks.cli --agent iris --game ALE/Pong-v5 --seeds 1 --episodes 3
```

- Run DIAMOND on Breakout with two explicit seeds and 5 episodes per seed:

```
python -m world_models.benchmarks.cli --agent diamond --game Breakout-v5 --seeds 0,1 --episodes 5
```

- Run DreamerV2 on a Gym env (example):

```
python -m world_models.benchmarks.cli --agent dreamerv2 --game Pong-v5 --seeds 1 --episodes 10
```

- Run all agents on Pong for 3 episodes using seed 0:

```
python -m world_models.benchmarks.cli --all-agents --game ALE/Pong-v5 --seeds 1 --episodes 3
```

Python API Example
------------------

Use the `BenchmarkRunner` when you need programmatic control:

```py
from world_models.benchmarks.runner import BenchmarkRunner
from world_models.benchmarks import adapters

runner = BenchmarkRunner(adapter_cls=adapters.IRISAdapter, out_dir="results/bench")
res = runner.run(env_spec={"game": "ALE/Pong-v5"}, seeds=[0,1], num_episodes=5)
print(res)
```
Python API Example
------------------

Use the `BenchmarkRunner` when you need programmatic control:

```py
from world_models.benchmarks.runner import BenchmarkRunner
from world_models.benchmarks import adapters

runner = BenchmarkRunner(adapter_cls=adapters.IRISAdapter, out_dir="results/bench")
res = runner.run(env_spec={"game": "ALE/Pong-v5"}, seeds=[0,1], num_episodes=5)
print(res)
```

Running the Atari 100k Benchmark
---------------------------------

To run the full Atari 100k benchmark on all 26 games using IRIS:

```
python benchmarks/atari_100k.py
```

This will train IRIS on each game for 100k environment steps with 5 random seeds per game, compute human-normalized scores, and compare to baselines.

Outputs
-------
- The runner saves results into the `out_dir` (default `results/bench`):
  - `benchmark_results.json` (raw structured results)
  - `benchmark_results.csv` (seed rows)
  - `benchmark_results.md` (human readable markdown table)
  - `benchmark_results.tex` (LaTeX table ready for papers)
Outputs
-------
- The runner saves results into the `out_dir` (default `results/bench`):
  - `benchmark_results.json` (raw structured results)
  - `benchmark_results.csv` (seed rows)
  - `benchmark_results.md` (human readable markdown table)
  - `benchmark_results.tex` (LaTeX table ready for papers)

Computing IQM and bootstrap CIs
-------------------------------

The runner stores per-seed means in the JSON under `aggregate.per_seed_means`. Use
the provided metrics helpers to compute IQM and bootstrap confidence intervals:

```py
from world_models.benchmarks import metrics, reporting
import json

res = json.load(open('results/bench/benchmark_results.json'))
per_seed = res['aggregate']['per_seed_means']
iqm = metrics.iqm_of_array(per_seed)
lower, upper = metrics.bootstrap_iqm_ci(per_seed, num_samples=2000, alpha=0.05)
print(f"IQM={iqm:.3f} (95% CI {lower:.3f} - {upper:.3f})")

# export LaTeX table from results
reporting.export_latex(res, 'results/bench/benchmark_results.tex')
```

Extending the harness
---------------------
- Create an adapter in `world_models/benchmarks/adapters.py` that implements:
  - `load_checkpoint(path: str)` and
  - `evaluate(num_episodes: int, render: bool = False)` returning `{"episode_returns": List[float]}`.
- Register your adapter in `world_models/benchmarks/cli.py` to expose it via the CLI.

Tests and CI
-----------
- Place smoke tests under `world_models/benchmarks/tests/` so CI can run them quickly.
- The repo contains a `mocking_classes.py` helper for building fake agents/environments for fast unit tests.

Where to start
--------------
- Run the examples in `examples/benchmark_iris.py` or use the CLI directly.
- If you need help wiring specific agent configs (device, preset, checkpoint paths), use the CLI `--device` and `--preset` flags or call the runner programmatically and pass `extra_kwargs`.