# DiT: Diffusion Transformer DiT (Diffusion Transformer) applies the transformer architecture to diffusion models for image generation. It treats diffusion as a sequence modeling problem where the transformer predicts noise at each timestep. Based on paper: [Scalable Diffusion Models with Transformers](https://arxiv.org/abs/2212.09748) (Peebles & Xie, 2023) ## Key Idea Instead of using CNNs (like U-Net) for diffusion, DiT uses a Vision Transformer (ViT) architecture: 1. **Tokenization**: Split image into patches, convert to tokens 2. **Diffusion as sequence**: Treat noise prediction as sequence modeling 3. **Transformer**: Process token sequence to predict added noise 4. **DDPM**: Generate images by iteratively denoising ## Architecture

DiT Architecture

*Figure 1: DiT architecture overview from the DiT paper (Peebles & Xie, 2023). Shows the transformer-based diffusion model with patch embedding, timestep conditioning, and noise prediction.*

Diffusion Transformer

Noisy image input Patch and position embeddings Timestep and condition tokens DiT transformer blocks Predicted noise output
## Components ### 1. Patch Embedding Converts input image into a sequence of tokens: - Image of size `(C, H, W)` split into `(H/P × W/P)` patches - Each patch linearly embedded to dimension `D` - Add positional embeddings ```{math} \begin{aligned} \mathbf{x} &\in \mathbb{R}^{C \times H \times W} \\ &\rightarrow \mathbf{tokens} \in \mathbb{R}^{(H/P \times W/P) \times D} \end{aligned} ``` ### 2. Timestep & Condition Embeddings - **Timestep**: sinusoial positional encoding for diffusion timestep `t` - **Class label**: Optional conditioning for class-conditional generation - Both added to each token ### 3. DiT Blocks Transformer blocks with: - **Self-attention**: Capture relationships between patches - **MLP**: Process each patch independently - **Adaptive Layer Norm (AdaLN)**: Condition on timestep/class Variants: - **AdaLN**: Adaptive Layer Norm - **AdaLN-Single**: More parameter-efficient - **AdaLN-Diagonal**: Condition variance too ### 4. Output Head Final layer predicting noise (ε) same dimension as input: ```{math} \hat{\epsilon} = \mathrm{Linear}(\mathbf{tokens}) \rightarrow \mathbb{R}^{C \times H \times W} ``` ## Training ```python :class: thebe from torchwm import DiTConfig, get_dit_config cfg = get_dit_config( DATASET="CIFAR10", BATCH=128, EPOCHS=100, IMG_SIZE=32, WIDTH=384, DEPTH=6, ) # Training uses DDPM noise scheduling ``` ```{math} q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\left(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1 - \bar{\alpha}_t)\mathbf{I}\right) ``` ```{math} p(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}\left(\mu_\theta(\mathbf{x}_t, t), \sigma_t^2\mathbf{I}\right) ``` ### Key Hyperparameters | Parameter | Default | Description | |-----------|---------|-------------| | `IMG_SIZE` | 32 | Input image size | | `PATCH` | 4 | Patch size | | `WIDTH` | 384 | Transformer embedding dim | | `DEPTH` | 6 | Number of blocks | | `HEADS` | 6 | Attention heads | | `TIMESTEPS` | 1000 | Diffusion steps | | `BETA_START` | 1e-4 | Noise schedule start | | `BETA_END` | 0.02 | Noise schedule end | ## Sampling (Generation) ```python :class: thebe # Start from random noise # Iteratively denoise for t in reversed(range(T)): ε = model(x_t, t) # Predict noise x_{t-1} = x_t - sqrt(1-alpha_bar_t) * ε # Remove predicted noise # Output: x_0 (generated image) ``` ```{math} \mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I}) ``` ```{math} \epsilon = \mathrm{model}(\mathbf{x}_t, t) ``` ```{math} \mathbf{x}_{t-1} = \mathbf{x}_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon ``` ### Classifier-Free Guidance For conditional generation, use classifier-free guidance: ```{math} \begin{aligned} \epsilon_\mathrm{cond} &= (1 + w) \cdot \epsilon_\mathrm{model}(\mathbf{x}_t, c) \\ &\quad - w \cdot \epsilon_\mathrm{model}(\mathbf{x}_t, \emptyset) \end{aligned} ``` where `w` is guidance weight (typically 1-10). ## Comparison to CNN-Based Diffusion | Aspect | U-Net (DDPM) | DiT (Transformer) | |--------|-------------|-------------------| | Architecture | CNN with skip connections | ViT | | Global attention | Limited | Full | | Scalability | Medium | High | | Quality | Good | Slightly better | | Compute | Efficient | Higher | ## Applications 1. **Image generation**: High-quality unconditional/class-conditional 2. **Image editing**: Inpainting, outpainting 3. **Video generation**: Temporal extension 4. **World models**: For model-based RL (as observation model) ## References - Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers.