Diffusion Policy

3 minute read

Published:

Article Goal

Explain Diffusion Policy algorithm for visuomotor policy learning in robotics

What is Diffusion Policy?

Diffusion Policy is an approach to robot behavior generation that represents policies as conditional denoising diffusion processes. Instead of directly predicting actions from observations, Diffusion Policy learns to gradually denoise random noise into coherent action sequences, conditioned on visual observations and robot states.

The key insight is that action generation can be framed as a generative modeling problem, where the policy learns the distribution of expert actions and samples from this learned distribution during execution. This approach naturally handles multimodal behaviors and avoids the mode collapse issues that plague traditional behavior cloning methods.

Diffusion Policy: Written Walkthrough

Core Concepts

  1. Action Diffusion - Models policy as a conditional diffusion process that generates action sequences by iteratively denoising noise.
  2. Multimodal Behavior Learning - Captures multiple valid ways to perform a task without averaging them out.
  3. Receding Horizon Control - Predicts sequences of future actions but only executes the first few, then re-plans.
  4. Visual Conditioning - Conditions the diffusion process on camera observations and proprioceptive state.

The Math

Forward Diffusion Process

The forward process gradually adds noise to action sequences: \(q(A_k | A_{k-1}) = \mathcal{N}(A_k; \sqrt{1-\beta_k}A_{k-1}, \beta_k I)\)

Where:

  • \(A_0\) is the original action sequence
  • \(A_k\) is the noisy version at diffusion step \(k\)
  • \(\beta_k\) is the noise schedule

Reverse Diffusion Process

The reverse process learns to denoise: \(p_\theta(A_{k-1} | A_k, O) = \mathcal{N}(A_{k-1}; \mu_\theta(A_k, O, k), \Sigma_\theta(A_k, O, k))\)

Where:

  • \(O\) represents observations (visual + proprioceptive)
  • \(\theta\) are the neural network parameters

Noise Prediction

The neural network learns to predict the noise that was added: \(\epsilon_\theta(A_k, O, k) \approx \epsilon\)

Where \(\epsilon\) is the noise that was added to create \(A_k\) from \(A_0\).

Training Loss

The training objective is: \(\mathcal{L} = \mathbb{E}_{A_0, \epsilon, k, O} \left[ \|\epsilon - \epsilon_\theta(A_k, O, k)\|^2 \right]\)

Sampling Process

During inference, actions are generated by:

  1. Start with random noise: \(A_K \sim \mathcal{N}(0, I)\)
  2. Iteratively denoise: \(A_{k-1} = \frac{1}{\sqrt{\alpha_k}}(A_k - \frac{\beta_k}{\sqrt{1-\bar{\alpha}_k}}\epsilon_\theta(A_k, O, k)) + \sigma_k z\)
  3. Output final action sequence: \(A_0\)

Where \(\alpha_k = 1 - \beta_k\), \(\bar{\alpha}_k = \prod_{i=1}^k \alpha_i\), and \(z \sim \mathcal{N}(0, I)\).

Action Chunking

The policy predicts action sequences of length \(T_a\): \(A = [a_t, a_{t+1}, ..., a_{t+T_a-1}]\)

But only executes the first \(T_e\) actions before replanning.

Algorithm Steps

  1. Data Collection - Gather expert demonstrations with observations and action sequences
  2. Noise Schedule Setup - Define diffusion timesteps and noise schedule \(\beta_1, ..., \beta_K\)
  3. Neural Network Training - Train noise prediction network \(\epsilon_\theta\) on demonstration data
  4. Inference Initialization - Start with random noise \(A_K \sim \mathcal{N}(0, I)\)
  5. Denoising Loop - For \(k = K, K-1, ..., 1\): predict noise and update \(A_{k-1}\)
  6. Action Execution - Execute first \(T_e\) actions from denoised sequence \(A_0\)
  7. Replanning - Observe new state and repeat sampling process
  8. Repeat - Continue until task completion

Advantages of Diffusion Policy

  1. Multimodal Behavior - Naturally learns and executes diverse behavioral modes without mode collapse
  2. High-Quality Trajectories - Generates smooth, high-quality action sequences
  3. Visual Robustness - Handles complex visual observations effectively
  4. Stable Training - More stable than GANs and avoids common training issues
  5. Expressiveness - Can represent complex, multimodal action distributions
  6. Strong Empirical Performance - Achieves 46.9% average improvement over state-of-the-art methods

Limitations

  1. Computational Cost - Requires multiple denoising steps during inference, slower than direct prediction
  2. Hyperparameter Sensitivity - Performance depends on noise schedule, diffusion steps, and architecture choices
  3. Training Time - Requires more training time than simple behavioral cloning approaches
  4. Memory Requirements - Needs to store and process action sequences rather than single actions
  5. Limited Real-Time Applications - Inference speed may limit real-time applications requiring high-frequency control
  6. Architecture Complexity - More complex to implement than standard imitation learning approaches