VIBE RESEARCH — VOLUME I

On-Policy Distillation
Adventures

Reading time
— min read
First published
Last updated
Note: Best viewed on larger screens (desktop or laptop); mobile support is limited.
Author
Scroll
§ 0

Vibe Research

Thoughts on recent research blogging trends. Expand to read Collapse

Recently, blogging has become an easier medium to distribute research123; many have gained wide popularity and impact in the AI research community. Besides ease of distribution, it is more accessible and less demanding for researchers to provide complete results. It also allows researchers to share content at any desired level of completeness, in a less stressful and more flexible format, structured freely according to their preferences. This flexibility incentivizes the sharing of knowledge and results, including negative findings that might otherwise remain unpublished.

With the proliferation of LLM assistants, it has become far easier to share research in rich media formats that go beyond traditional conference papers. We can now use animations and interactive visualizations that can be coded in one shot with tools like Claude Code or Codex45, which would be impractical or impossible in conventional paper formats (e.g., distill.pub). Researchers can use these coding tools to speed up their workflow, as much of the implementation can be written or assisted by LLMs.

§ 1

Background

On-policy distillation (OPD) has gained renewed attention following the Thinking Machines Lab blog by Kevin Lu et al.1 OPD avoids the distribution shift problem because learning samples are generated directly from the student policy, ensuring that the training distribution matches the distribution encountered at inference time6. This technique allows researchers to first train a larger frontier model and then reduce inference costs by robustly distilling most of its capabilities into a smaller model. In this blog, we present several empirical findings on OPD: (1) how to stabilize the naive implementation, (2) which hyperparameters most strongly affect performance, and (3) modifications to the vanilla setup, such as alternative divergences, combining multiple teacher models, and selective masking strategies that influence evaluation performance.

Supervised fine-tuning (SFT) in LLMs can be viewed as a special case of knowledge distillation, where the supervision corresponds to a one-hot target distribution over tokens, with a single reference response from the teacher \( {\textcolor{red}{\pi_T}} \) for each input.

\[ \mathcal{L}_{SFT}(\theta) = \mathbb{E}_{x\sim D} \mathbb{E}_{y\sim {\textcolor{red}{\pi_T}}(\cdot|x)} \left[ \frac{1}{L_y} \sum_{t=1}^{L_y} -\log {\pi_\theta}(y_t|y_{\lt t},x) \right] \]
(1)

On-policy distillation was formalized in the generalized knowledge distillation framework6, where the expectation is taken over the student's \( {\textcolor{green}{\pi_\theta}} \) rollout rather than the teacher's:

\[ \mathcal{L}_{OPD}(\theta) = \mathbb{E}_{x\sim D} \mathbb{E}_{y\sim {\textcolor{green}{\pi_\theta}}(\cdot|x)} \left[ \frac{1}{L_y} \sum_{t=1}^{L_y} D_{KL} \left( {{\pi_\theta}}(\cdot|y_{\lt t},x) \parallel {{\pi_T}}(\cdot|y_{\lt t},x) \right) \right] \]
(2)

In contrast to SFT, where the expectation is taken under the teacher policy \( {\textcolor{red}{\pi_T}} \), on-policy distillation samples trajectories from the learner itself \( {\textcolor{green}{\pi_\theta}} \) and performs updates using policy-gradients, where the advantage is the token-level KL estimates1. Training on the learner's own on-policy samples has been shown to reduce forgetting and improve generalization78. Intuitively, training on-policy aligns the learning distribution with the evaluation distribution, substantially reducing distributional shift.

Beyond the distillation framing, we can also compare on-policy distillation to reinforcement learning from verifiable rewards (RLVR). On-policy distillation provides token-level dense advantages derived from the teacher signal, enabling fine-grained credit assignment. In contrast, common RLVR methods such as GRPO10 often operate primarily at response level with a single scalar advantage for the whole sequence, limiting credit assignment granularity across tokens of a single response.

§ 2

Naive Implementation

Matching the complete KL divergence for LLMs requires summing over the entire vocabulary at every token position, which is computationally expensive when the vocabulary is large. Instead, previous works estimate the KL from the single token actually sampled at each position24. This reduces the problem to a token-level RL-like objective, where the advantage is the per-token log-probability difference between teacher and student:

$$\begin{aligned} A_t &= \log \pi_T(y_t\mid y_{\lt t},x)-\log \pi_\theta(y_t\mid y_{\lt t},x), \\ \nabla_\theta J(\theta) &= \mathbb{E}_{y\sim\pi_\theta(\cdot\mid x)} \!\left[\sum_t A_t\,\nabla_\theta \log\pi_\theta(y_t\mid y_{\lt t},x)\right]. \end{aligned}$$
(3)

Here $A_t$ is the per-token advantage: it measures how much more likely the teacher considers token $y_t$ compared to the student, serving as a dense signal that drives the student toward the teacher's distribution1. We keep all updates strictly on-policy to keep this estimator unbiased.

Throughout the rest of this post, we use the token-level ratio \( r_t := \frac{\pi_\theta(y_t \mid y_{< t}, x)}{\pi_{\rm T}(y_t \mid y_{< t}, x)} \). Under this convention, the reverse-KL reward used in Algorithm 1 is \( A_t = \log \pi_{\rm T}(y_t \mid y_{< t}, x) - \log \pi_\theta(y_t \mid y_{< t}, x) = -\log r_t \). We call \(A_t\) either the advantage or the policy-gradient weight in the remainder of this post.

Naive Pseudo Code

We implement the naive baseline using the k1 estimator16, the single-sample estimator of reverse KL, as the per-token advantage. The pseudo code is shown in Algorithm 1.


for batch in train_loader:
    y, logp = student.rollout(batch)
    logp_t = teacher.log_prob(batch, y)

    A = logp_t - logp                   # per-token k1
    loss = -(A.detach() * logp).mean()  # standard token-level REINFORCE

    student.update(loss)
Algorithm 1. Naive on-policy distillation with k1 advantage and standard REINFORCE.

Experiment Settings

For this naive baseline, we follow Algorithm 1 directly. Unless otherwise stated, all runs use Qwen/Qwen3-4B-Base as the student and Qwen/Qwen3-4B as the teacher, and we train for one epoch on the DeepMath-103K prompts. For evaluation, we use both mean@8 and best@8 as metrics, averaged over: BeyondAIME, AIME25, AIME24, HMMT25, BRUMO25. All other hyperparameters use the defaults listed in the Hyperparameter Analysis section. We implemented our experiments using the veRL framework.

Naive Results

We plot the Train KL results, which in the case of OPD correspond to the training cost/negative reward (left), and the aggregate eval performance (right) below. We can immediately see that there is a stability issue with the naive implementation: the KL estimator drops sharply after around 120 training steps (even going negative), which is also directly accompanied by a collapse in eval performance.

Naive Train KL

Naive Eval Performance

Figure 1. Naive k1 training dynamics. Left: train KL estimator over steps (negative values signal estimator breakdown). Right: aggregate eval performance (mean@8) over training.
§ 3

Stabilizing Experiments

Given the instability observed in the previous section, we first establish a stable baseline before comparing different design choices. Without this, apparent improvements could simply reflect training instability rather than genuine gains from the proposed method or settings.

The Instability Problem

We investigated two angles to try to stabilize the runs:

  • Training-inference mismatch correction313: the rollout engine and learner backend can assign different token probabilities even when they share the same parameters (due to differences in CUDA kernels, batching strategies, reduce ordering, etc.), making the updates effectively off-policy. We test two importance-sampling correction techniques to mitigate this mismatch:
    • Truncated-IS (TIS): clamp the per-token importance ratio of rollout and learner to a fixed range $[\epsilon_{\rm lo}, \epsilon_{\rm hi}]$. Tokens whose ratio falls outside the range still contribute to the loss, but with a bounded importance weight.
    • Masked-IS (MIS) or IcePop12: rather than clamping, completely mask out any token whose importance ratio falls outside the allowed range. Masked tokens contribute no gradient at all, preventing extreme-ratio tokens from corrupting the update.
  • Adjusting Learning rate: high learning rates can destabilize gradient descent through “overshooting.”25 Our naive baseline used a learning rate of 1e-5, following the Thinking Machines blog, which is substantially higher than common RLVR defaults (around 1e-6). We therefore conducted an ablation run with learning rate=1e-6.

Unless otherwise stated, we use the same rollout-ratio thresholds for both methods: \(\epsilon_{\rm lo}=0.5\) and \(\epsilon_{\rm hi}=2.0\). For TIS this means clamping to \([0.5, 2.0]\), while for MIS this means masking tokens outside \([0.5, 2.0]\).

Results and Findings

  • : large gradient norm spikes and growing rollout/learner mismatch (PPL drift). The logged Train KL even goes negative. Eval performance is noisy and fails to reliably improve.
  • : much steadier gradients with fewer grad norm spikes. Eval performance stays stable, but learning is slower and plateaus below Masked-IS.
  • : looks stable early, but later the mismatch (PPL drift) explodes, gradients blow up, and eval performance collapses to zero near the end.
  • : smooth training dynamics (PPL Drift / grad norm / entropy), consistently low Train KL, and the strongest eval performance. With the runs now stabilized, we adopt this configuration as the default for subsequent experiments.

Key Insight

Even with a single update per rollout batch, the off-policy mismatch between the rollout and learner can snowball into unstable gradients and late-run collapse. Masked-IS/IcePop keeps the policy-gradient estimator reliable, as evidenced by stability metrics throughout training, and consistently achieves the strongest evaluation performance. It will therefore be our default going forward.

Detailed hyperparameters for this default run are listed in the Hyperparameter Analysis section.

Figure 2. Stability Diagnostics & Outcomes

Overlay runs with the toggles and switch metrics with the tabs. Clicking a run name in the results list resets to a head‑to‑head comparison against MIS (Default).

Top plot: stability diagnostics over training (gradient norm, rollout/learner PPL drift, entropy). Bottom plot: outcomes over training (eval performance mean@8, train KL to teacher).

Stability diagnostics
Gradient Norm: Tracks the magnitude of gradients. Spikes indicate unstable updates (often from rollout/learner mismatch and large importance weights).
Outcomes
Eval Performance: Out-of-distribution performance measured as mean reward (mean@8) on validation tasks. [EMA smoothed, α=0.15]
§ 4

Hyperparameter Analysis

After stabilizing the OPD experiments in the previous section with MIS rollout correction, we now take a closer look at how some key hyperparameters could affect the eval performance from OPD training.

Hyperparameters

We refer to default as the baseline setting used for all runs throughout this blog: the Masked-IS configuration from §3 with the hyperparameters listed below (click to expand).

Table 1. Show hyperparameter defaults (click to expand)
Item Default
Core
Teacher modelQwen/Qwen3-4B
Student modelQwen/Qwen3-4B-Base
Train datasetdeepmath
Max response length8192
Advantage typek1
Temperature1.0
Gamma0.0
KL coef0.0
Samples per prompt (N)4
Total epochs1
Evaluation
Eval benchmarksbrumo25 / hmmt25 / beyondaime / aime24 / aime25
Temperature0.7
Top-p0.9
K8
Optimization
OptimizerAdamW
LR1e-5
Schedulerconstant
Betas[0.9, 0.95]
Weight decay0.0
Entropy coef0.0
Grad clip1.0
Batch size512
PPO mini batch size512
Rollout correction
Rollout RStoken
Rollout RS threshold2.0
Rollout RS threshold lower0.5

Hyperparameter Sweeps

We select key hyperparameters for OPD and sweep them across values to compare them against the default run. Note that moving a second slider automatically resets the others back to default hyperparameters. We leave the effects of hyperparameter combinations to future work.

Epochs 1
12
Group Size 4
148
Response Length 8k
4k8k16k
Entropy Bonus 0.0
0.00.01
Teacher Size 4b
Qwen3-4BQwen3-8BQwen3-14BQwen3-32B
Gamma 0.0
0.00.10.5
Temperature 1.0
0.10.51.01.2
Adv. Whitening Batch-level
No WhitenSample-levelGroup-levelBatch-level
mean@8 (avg)
hmmt25
brumo25
beyondaime
aime24
aime25
Figure 3. Interactive hyperparameter sweep explorer. Left: sliders select one hyperparameter axis at a time. Right: per-benchmark eval scores at peak training step. Select either mean@8 or best@8. Moving a slider resets others to default.

Hyperparameter Sweep Takeaways

  1. Response length matters a lot for eval performance: longer responses are consistently better, which suggests a test-time-scaling effect, aligning with most RLVR results.
  2. Running multiple epochs over the same dataset improves performance, suggesting the model may still be undertrained and could benefit from more data.
  3. Larger teacher models did not help much in the current setup. One plausible reason is that these Qwen3 teachers are themselves distilled from larger upstream teachers and may have similar output distributions14. The 4B teacher is more “on-policy” than the 8B and 14B teachers, which could explain the better performance. Furthermore, since the 8B and 14B models were themselves distilled from the 32B model, the 32B may be the better representative distribution to distill from.
  4. Larger group sizes did not help much here, which is somewhat surprising given that scaling group size often improves performance in RLVR15.
§ 5

Divergence as Advantage Shaping

In on-policy distillation, the divergence to the teacher is the only learning signal. In this section, we explore if changing the default form of divergence from reverse KL to other forms can lead to better learning outcomes.

Naive k1 vs k3

We start with a small pilot run: keep the setup fixed and change the per-token advantage form, from k1 (default) to k316, which is a lower variance estimator for the reverse KL. The two plots below are the train KL and average mean@8 of eval benchmarks over training. For k3, we observe a slower increase in eval performance and unstable KL which follows by an eventual collapse.

Eval Performance

KL

Figure 4. k1 vs naive k3 pilot. Left: eval performance over training. Right: train KL estimator. Naive k3 produces biased gradients, leading to worse eval performance and persistently higher KL.

What went wrong?

Naively using divergence values as policy-gradient weights introduces bias in on-policy distillation when the advantage depends on the current policy. While k3 provides an unbiased estimate of the KL value, it becomes biased when used as the advantage in the policy-gradient. In this pilot experiment, this bias leads to substantially higher training KL and significantly worse eval performance compared to k1.

The missing piece is the correction term that arises from differentiating through the ratio \( r = \pi_\theta / \pi_{\rm T} \), i.e., accounting for the \(\theta\)-dependence inside the integrand11. Next, we derive the unbiased weighting for OPD policy-gradients under general f-divergence distillation objectives and show the correct policy-gradient form.

Derivation of Divergence Advantages

Using the raw divergence value directly as a REINFORCE weight drops the reward-gradient correction and gives a biased gradient.

Compact derivation. We start from an f-divergence objective. Here \(D_f\) is generated by a convex function \(f:\mathbb{R}_{+}\!\to\!\mathbb{R}\) with \(f(1)=0\). The default reverse KL used in the main experiments is the special case \(f(r)=r\log r\): \[ D_f(\pi_\theta \,\|\, \pi_{\rm T}) = \mathbb{E}_{y\sim \pi_{\rm T}}\!\left[f(r_\theta(y))\right], \qquad r_\theta(y):=\frac{\pi_\theta(y)}{\pi_{\rm T}(y)}. \] Since the teacher policy \(\pi_{\rm T}\) is fixed with respect to \(\theta\), only \(\pi_\theta\) contributes to the derivative inside \(r_\theta(y)\). The key identity is \[ \nabla_\theta r_\theta(y) = \frac{\nabla_\theta \pi_\theta(y)}{\pi_{\rm T}(y)} = r_\theta(y)\,\nabla_\theta \log \pi_\theta(y). \] Plugging this in gives the gradient:

$$ \begin{aligned} \nabla_\theta D_f &= \mathbb{E}_{y\sim\pi_{\rm T}}\!\big[\nabla_\theta f(r_\theta(y))\big] && \text{(fixed }\pi_{\rm T}\text{)} \\[4pt] &= \mathbb{E}_{y\sim\pi_{\rm T}}\!\big[f'(r_\theta(y))\,\nabla_\theta r_\theta(y)\big] && \text{(chain rule)} \\[4pt] &= \mathbb{E}_{y\sim\pi_{\rm T}}\!\big[f'(r_\theta(y))\,r_\theta(y)\,\nabla_\theta\log\pi_\theta(y)\big] && \text{(use above identity)} \\[4pt] &= \mathbb{E}_{y\sim\pi_\theta}\!\big[f'(r_\theta(y))\,\nabla_\theta\log\pi_\theta(y)\big] && \text{(importance sampling)}. \end{aligned} $$
(4)

This shows that the correct policy-gradient weight is \(f'(r_\theta(y))\). The full derivation below reproduces the same result step by step and shows how it relates to the alternative form \(h(r_\theta(y))+r_\theta(y)h'(r_\theta(y))\), where \(h(r):=f(r)/r\) (click to expand).

Show full derivation Hide derivation

Step 1. Classic REINFORCE review. The standard policy-gradient optimizes an expected return \(J(\theta)=\mathbb{E}_{y\sim\pi_\theta}[G(y)]\). Applying the log-derivative trick \(\nabla_\theta\pi_\theta(y)=\pi_\theta(y)\nabla_\theta\log\pi_\theta(y)\), we obtain

$$ \nabla_\theta J(\theta) = \nabla_\theta\,\mathbb{E}_{y\sim\pi_\theta}[G(y)] = \mathbb{E}_{y\sim\pi_\theta}\!\big[G(y)\,\nabla_\theta\log\pi_\theta(y)\big]. $$
(5)

The key assumption is that \(G(y)\) does not depend on \(\theta\). It is a fixed scalar observed after the rollout, such as a game score or task reward. Under that assumption, only the sampling distribution \(\pi_\theta\) carries the \(\theta\)-dependence, giving the clean single-term result above.

Step 2. Distillation changes this because the “reward” depends on \(\theta\). In on-policy distillation we want to minimize an f-divergence between the student policy \(\pi_\theta\) and a teacher policy \(\pi_{\rm T}\):

$$ D_f(\pi_\theta \,\|\, \pi_{\rm T}) = \mathbb{E}_{y \sim \pi_{\rm T}}\!\left[f(r_\theta(y))\right], \qquad r_\theta(y):=\frac{\pi_\theta(y)}{\pi_{\rm T}(y)}. $$
(6)

Since \(r_\theta(y)\) depends on \(\theta\) through \(\pi_\theta(y)\), so does \(f(r_\theta(y))\). To rewrite the objective using on-policy samples from \(\pi_\theta\), we apply importance sampling:

$$ \begin{aligned} D_f(\pi_\theta\|\pi_{\rm T}) &= \mathbb{E}_{y\sim\pi_{\rm T}}\!\big[f(r_\theta(y))\big] = \int \pi_{\rm T}(y)\,f(r_\theta(y))\,dy \\[4pt] &= \int \pi_\theta(y)\cdot\frac{f(r_\theta(y))}{r_\theta(y)}\,dy = \mathbb{E}_{y\sim\pi_\theta}\!\big[h(r_\theta(y))\big], \qquad h(r):=\frac{f(r)}{r}. \end{aligned} $$
(7)

This resembles an expected-reward objective \(\mathbb{E}_{y\sim\pi_\theta}[G_\theta(y)]\), but now the effective “reward” \(G_\theta(y)=-h(r_\theta(y))\) is not fixed: it depends on \(\theta\) through \(r_\theta(y)=\pi_\theta(y)/\pi_{\rm T}(y)\). Naively applying REINFORCE would ignore the derivative of the integrand \(h(r_\theta(y))\) with respect to \(\theta\), and therefore produce a biased estimator of the true gradient.

Step 3. Apply the product rule because both the sampler and the integrand depend on \(\theta\). Writing the expectation explicitly, \(\theta\) appears in two places: in the sampling distribution \(\pi_\theta\) and in the integrand \(h(r_\theta(y))\).

$$ \begin{aligned} \nabla_\theta D_f &= \nabla_\theta \int \pi_\theta(y)\,h(r_\theta(y))\,dy \\[4pt] &= \int \underbrace{\nabla_\theta\pi_\theta(y)}_{\pi_\theta(y)\,\nabla_\theta\log\pi_\theta(y)}\,h(r_\theta(y))\,dy \;+\; \int \pi_\theta(y)\,\underbrace{\nabla_\theta h(r_\theta(y))}_{\text{chain rule through }r_\theta(y)}\,dy. \end{aligned} $$
(8)

For the second term, apply the chain rule:

$$ \nabla_\theta h(r_\theta(y)) = h'(r_\theta(y))\,\nabla_\theta r_\theta(y). $$

Substituting into the previous expression gives

$$ \begin{aligned} \nabla_\theta D_f &= \mathbb{E}_{y\sim\pi_\theta}\!\Big[h(r_\theta(y))\,\nabla_\theta\log\pi_\theta(y)\Big] + \mathbb{E}_{y\sim\pi_\theta}\!\Big[h'(r_\theta(y))\,r_\theta(y)\,\nabla_\theta\log\pi_\theta(y)\Big] \\[4pt] &= \mathbb{E}_{y\sim\pi_\theta}\!\Big[\big(h(r_\theta(y))+r_\theta(y)\,h'(r_\theta(y))\big)\,\nabla_\theta\log\pi_\theta(y)\Big]. \end{aligned} $$
(9)

Using \(f(r)=r\,h(r)\), we have

$$ f'(r)=h(r)+r\,h'(r), $$

so the final result can be written in the compact form

$$ \nabla_\theta D_f = \mathbb{E}_{y\sim\pi_\theta}\!\big[f'(r_\theta(y))\,\nabla_\theta\log\pi_\theta(y)\big]. $$
(10)

Therefore, the correct policy-gradient weight for minimizing the divergence is \(f'(r_\theta(y))\); equivalently, the corresponding advantage or policy-gradient weight for OPD is \(-f'(r_\theta(y))=-h(r_\theta(y))-r_\theta(y)h'(r_\theta(y))\). Naively using only the value term \(h(r_\theta(y))\) would ignore the reward-gradient term and thus produce a biased gradient estimator.

Why k1 Is Special

Going back to our pilot experiment, for reverse KL, with \( r = \pi_\theta / \pi_{\rm T} \), the generator is \( f(r)=r\log r \), so the policy-gradient weight is $-f'(r) = -(1 + \log r)$.

This differs from the Algorithm 1 advantage, \( -\log r \), only by the constant \( -1 \), therefore vanishes inside \( \mathbb{E}[\cdot \nabla_\theta \log \pi_\theta] \). So for reverse KL specifically, the naive k1 advantage \( -\log r \) used in Section 2 is unbiased.

Why k3 is biased

The Schulman k3 estimator16 for the reverse-KL value is \( \frac{1}{r}-1+\log r \). If we naively turn that value estimate into advantage, we have: $A^{\mathrm{naive\text{-}k3}}(r) = -\frac{1}{r} + 1 - \log r.$

The correct policy-gradient weight for reverse-KL training is instead \( A^{\mathrm{correct}}(r) = -f'(r)=-(1+\log r) \), so their difference is:

$$A^{\mathrm{naive\text{-}k3}}(r) - A^{\mathrm{correct}}(r) = 2 - \frac{1}{r}.$$
(11)

Because this difference depends on \(r\), it is not a constant term and therefore does not vanish under the expectation, introducing bias into the resulting policy-gradient estimator. Overall, an f-divergence estimator that is unbiased for the divergence value is not unbiased for the corresponding policy-gradient estimator11.

Divergence Types

With the properly derived advantage for general f-divergences, we examine three representative divergences (i.e., Jensen, Hellinger, Tsallis) and illustrate their corresponding advantage form, \( -f'(r) \), below. We also plot the distribution of importance ratios \( r_\theta \) observed in actual responses before any training (Qwen3-4B-Base vs. Qwen3-4B), which can reveal which weighting regimes dominate the early training dynamics.

Figure 5. Importance Ratio vs. OPD Advantage

X-axis: importance ratio \(r_\theta=\pi_\theta/\pi_{\rm T}\). Curves: per-token policy-gradient weights (correct training weight uses \(-f'(r)\), while naive value-form uses \(-h(r)=-f(r)/r\)). We also show the distribution of \(r_\theta\), highlighting which weighting regime dominates training.

Tip: click legend entries to toggle curves.

Divergence Shaping Results

Using the derived policy-gradient weight -f'(r), we compare the average mean@8 of multiple f-divergences (Jensen, Hellinger, Tsallis) against the default (k1) in the bar chart below.

Figure 6. Divergence Shaping vs Default

Compare each divergence reward variant against the default run. Select a metric tab to view the max-over-training score for that benchmark.

Bars use max-over-training mean@8 for the selected benchmark metric, displayed as percentages. The dashed horizontal line marks Default run.

Findings: Divergence Shaping vs Default

  • The naive k3 pilot run is clearly unstable relative to k1: lower eval performance and persistently higher train KL, because its gradient is biased.
  • With the derived advantage -f'(r) weighting, Hellinger (+0.3%) and Tsallis (+0.3%) slightly improve overall AVG mean@8 over default, while Jensen is 0.5% below.
§ 6

Teacher Ensemble

Instead of a single teacher policy, we can define $\pi_{\rm T}$ as an ensemble or a mixture of teachers. For example, we can have multiple teacher models of different sizes or from different model families, and combine their outputs to define the reference distribution. This raises the question: how does the choice of aggregation strategy affect eval performance? In this section, we use Qwen3-4B,Qwen3-8B,Qwen3-14B,Qwen3-32B as ensemble of teachers, and leave teachers of different model families for future work.

Ensemble Strategies

We explore several strategies for aggregating the outputs of teacher models:

  • Arithmetic mean: $\pi_{\rm T}=\frac{1}{K}\sum_k \pi_k$ (probability-space averaging).
  • Geometric mean: $\pi_{\rm T}\propto \prod_k \pi_k^{w_k}$ (logit-space averaging).
  • Hard switch: pick a single teacher per query sample. As training progresses, we switch to the next larger teacher, similar to a form of curriculum learning.
  • Soft switch: the reference distribution is a linearly weighted combination of “neighboring” teachers. We gradually shift the weights from smaller to larger teachers over time.
  • Control-token routing: condition the teacher choice on a control prefix token before generation. We split each batch evenly across teachers (e.g., 25% each for Qwen3-4B, Qwen3-8B, Qwen3-14B, and Qwen3-32B).
  • Random: pick a teacher at random for each query sample.

The animation below shows how each ensemble strategy computes the policy-gradient advantage token by token. The student first generates an on-policy response on the left. Some methods depend on training progress, represented as a ratio in [0,1], where 0 indicates the start of training and 1 indicates its completion (illustrated in the left panel). In the middle, the chosen teachers evaluate the sampled token, where $W$ denotes the weights used to linearly combine the teacher advantages. Finally, the aggregation method combines these evaluations into a single reference score on the right.

Switch between aggregation methods using the tabs below to see how the routing and weighting change. Press the replay button to start the animation from the beginning, or step button to go through it token by token to see the intermediate values.

Figure 7. Ensemble Strategies Animation

Student Sampler
Current token
pstudent(token)
Training progress [0-1] 0.00
Control token <|Teacher-1|>
Waiting for generation...
Teacher Log-Prob Evaluation
Reference Aggregation
pref(token) = meank pk(token)
pref = —
Advantage = —logpref - logpstudent
Token-Level PG Vector View
logpstudent(token) × advantage

The student samples token-by-token; each method changes how teacher scores are routed into p_ref and the resulting advantage.

Ensemble Results

We train each ensemble strategy with other hyperparameters fixed to the same default (no ensemble) setup and compare eval performance curves over training.

Figure 8. Learning Curves for Ensemble Strategies

EMA-smoothed eval performance curves over training for each ensemble strategy. Includes the Default run for direct comparison.

EMA alpha = 0.15 Mean@8 + Best@8
Eval Performance (mean@8)
Eval Performance (best@8)

Findings: Ensemble Strategies

  • Hard switch (curriculum-based teacher scheduling) is the most effective strategy, achieving +1.3% mean@8 and +3.6% best@8 over the default by the end of training. Gradually transitioning from a weaker to a stronger teacher provides the student with a natural curriculum.
  • Control-token routing performs roughly on par with the baseline. The idea is similar to Upside Down RL18 and Decision Transformer17, conditioning generation on a control prefix that selects the teacher. However, the prefix token introduces distribution shift relative to the base model, which can destabilize early training. In principle, control-token routing offers significant flexibility for teacher selection and conditional behaviors. One possible improvement would be to run an SFT warmup stage before switching to OPD, allowing the student to learn the control-token semantics before on-policy training begins.
  • The remaining strategies (soft switch, arithmetic mean, geometric mean) perform roughly on par with the default. In contrast, random teacher performs the worst (-1.6% mean@8 vs. default), which is expected since randomly switching teachers per sample makes the reward signal more stochastic and may destabilize training.
§ 7

Token-wise Filtering

Recent work on RLVR19 shows that high-entropy "forking tokens," roughly 20% of tokens in a chain of thought, drive most of the learning signal. Motivated by this finding, we explore whether this insight transfers to on-policy distillation: can we improve training by selectively filtering which tokens receive gradient updates?

Filtering Criteria

We test three token-level scoring functions to rank tokens, then only apply gradient updates to tokens within a chosen percentile range:

  • KL divergence: focus on tokens where the student-teacher disagreement is smallest/largest.
  • Student entropy: focus on tokens where the student is most certain/uncertain, analogous to the "forking tokens" identified in RLVR.
  • Teacher entropy: focus on tokens where the teacher itself is certain/uncertain.

Formally, let $s_t$ be the token-level score under the chosen criterion and $m_t \in \{0,1\}$ indicate whether token $t$ falls inside the selected percentile range. The filtered advantage is:

$$m_t=\mathbf{1}\!\left[p_{\min} \le \operatorname{percentile}(s_t) \le p_{\max}\right],\qquad A_t^{\text{filtered}} = m_t \cdot A_t.$$
(12)

This concentrates the gradient budget on selected tokens based on heuristic rules. We sweep the percentile ranges (0-30, 30-70, 70-100) for each criterion. The interactive visualization below shows which tokens are selected under each criterion and range for a sample sequence. Move the P_MIN and P_MAX sliders to see how the tokens are filtered, and choose the filtering criterion at the top left to explore different strategies.

Filtering Criterion
Masked out tokens are not trained
Range: 70-100 percentile
Figure 9. Interactive token filtering visualization. Select a filtering criterion and percentile range to see which tokens are masked (blacked out in the last row) vs. retained for gradient updates.

The table below summarizes all token-filtering runs, showing both peak and end-of-training outcomes so we can compare raw gains versus training stability.

Table 2. Filtering Methods Results

mean@8/best@8 are shown as both max-over-training and final-step values, while train KL/entropy use final-step values.

Run Mean@8 (max) ↑ Best@8 (max) ↑ Mean@8 (last) ↑ Best@8 (last) ↑ Train KL (last) Entropy (last)

Findings: Token-wise Filtering

  • KL 70-100 is the best filter: focusing gradient updates on the highest-KL tokens (top 30%) matches the default on peak mean@8 (21.38% vs 21.44%) while showing markedly better end-of-training performance: +1.8% mean@8 (20.27% vs 18.43%) and +1.4% best@8 (33.83% vs 32.44%) at the final checkpoint. It also achieves the highest best@8 over the full run at 37.50% (+1.7% over default).
  • Low-percentile filtering is catastrophic. KL 0-30 (8.93%), KL 30-70 (9.68%), and student entropy 0-30 (10.53%) all experience performance collapse. Training only on tokens where the student already agrees with the teacher (low KL) or is already confident (low entropy) provides almost no learning signal.
  • Teacher entropy shows an inverted pattern: low teacher entropy (0-30) reaches 18.56% mean@8, while high teacher entropy (70-100) collapses to 11.07%. Tokens where the teacher is uncertain provide poor supervision, which is quite intuitive.
§ 8

Conclusion & Future Work

Conclusion

Through empirical exploration of on-policy distillation, we come with several key take-aways.

  • Stability is the first bottleneck. The naive baseline can run into stability issues and make the k1 estimator values negative. Masked importance sampling (MIS / IcePop) resolves this and is a prerequisite for all subsequent experiments.
  • Hyperparameters are high-leverage. Max response length (higher the better), rollout temperature (best at 1.0), and number of epochs (more the better) highly affect eval performance.
  • Divergence choice has reward-shaping effects. Replacing KL with alternative f-divergences (Hellinger, Tsallis) produces gains (+0.3% mean@8), suggesting that the shape of the reward function does influence optimization.
  • Curriculum-based teacher scheduling is the biggest win. Hard switch between a weaker and stronger teacher yields +1.3% mean@8 and +3.6% best@8 over default by end of training.
  • Token filtering improves training stability. Concentrating gradient updates on high-KL tokens (top 30%) improves end-of-training mean@8 by +1.8% and achieves the best best@8 of 37.50% (+1.7% over default).

Limitations

  • Domain scope is narrow. The experiments focus on DeepMath-103K prompts and math evaluations, so results may not transfer directly to coding, tool use, or broader instruction-following tasks.
  • Model-family/size scope is limited. Most comparisons use Qwen3-4B teacher-student variants, so cross-family and student size behavior remains untested.
  • Statistical confidence is limited. The runs are primarily exploratory comparisons and do not yet include multi-seed confidence intervals.

Future Work

  • SFT-then-OPD: run a supervised fine-tuning warmup to bring the student closer to the teacher policy. This may be particularly beneficial for the control-token ensemble method, since the control-token prefix introduces distribution shift.
  • More diverse settings: we only investigated under mathematical reasoning and need to expand to coding, tool-use, and agentic scenarios.
  • Self-context distillation: recent work on self-distillation20212223 shows that a single model can act as both teacher and student by conditioning on privileged information (verified reasoning traces, environment feedback, or demonstrations). On-policy distillation is a key step in these frameworks. Investigating how conditioning on privileged context interacts with the techniques explored in this blog (divergence shaping, token filtering, curriculum scheduling) is a promising direction for future work.
§ 9

Acknowledgements

This research was made possible by Block, which generously sponsored all GPU and compute resources. All experiments were conducted on GB200 clusters.

Thanks to the team at Thinking Machines, and in particular Kevin Lu, for project discussion and the original OPD blog for inspiration.

The Stack

HardwareGB200 Nodes
ArchitectureARM (aarch64)
GPUs/Node4× B200
CPUs/Node144 cores
Memory/Node~1.5 TB

References

Citation and source links

Citation

@misc{zhao2026opd,
  title = {On-Policy Distillation Adventures},
  author = {Zhao, Andrew},
  journal = {Andrew Zhao's Blogs},
  year = {2026},
  url = {https://andrewzh112.github.io/opd_adventures}
}

Papers and References

  1. Lu, K., & Thinking Machines Lab. (2025). On-Policy Distillation . Technical blog post.
  2. Schulman, J., & Thinking Machines Lab. (2025). LoRA Without Regret . Technical blog post.
  3. Yao, F., et al. (2025). Your Efficient RL Framework Secretly Brings You Off-Policy RL Training . Technical blog post.
  4. Anthropic. (2025). Claude 3.7 Sonnet and Claude Code . Product announcement blog post.
  5. OpenAI. (2025). Introducing Codex . Product announcement blog post.
  6. Agarwal, R., Vieillard, N., et al. (2024). On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes . ICLR 2024.
  7. Chu, T., Zhai, Y., et al. (2025). SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training . ICML 2025.
  8. Shenfeld, I., Pari, J., et al. (2025). RL's Razor: Why Online Reinforcement Learning Forgets Less . ICLR 2025.
  9. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network . arXiv preprint.
  10. Shao, Z., Wang, P., Zhu, Q., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models . arXiv preprint.
  11. Shah, V., Obando-Ceron, J., Jain, V., et al. (2025). A Comedy of Estimators: On KL Regularization in RL Training of LLMs . arXiv preprint.
  12. Ling Team, et al. (2025). Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model . arXiv preprint.
  13. Liu, J., Li, Y., et al. (2025). When Speed Kills Stability: Demystifying RL Collapse from the Training-Inference Mismatch . Technical blog post.
  14. Qwen Team. (2025). Qwen3 Technical Report . arXiv preprint.
  15. Hu, J., et al. (2025). BroRL: Scaling Reinforcement Learning via Broadened Exploration . arXiv preprint.
  16. Schulman, J. (2020). Approximating KL Divergence . Blog post.
  17. Chen, L., Lu, K., et al. (2021). Decision Transformer: Reinforcement Learning via Sequence Modeling . NeurIPS 2021.
  18. Schmidhuber, J. (2019). Reinforcement Learning Upside Down: Don't Predict Rewards -- Just Map Them to Actions . arXiv preprint.
  19. Wang, S., et al. (2025). Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning . NeurIPS 2025.
  20. Zhao, S., et al. (2026). Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models . arXiv preprint.
  21. Hübotter, J., et al. (2026). Reinforcement Learning via Self-Distillation . arXiv preprint.
  22. Shenfeld, I., et al. (2026). Self-Distillation Enables Continual Learning . arXiv preprint.
  23. Shi, T., et al. (2026). Experiential Reinforcement Learning . arXiv preprint.
  24. Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback . NeurIPS 2022.
  25. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning . MIT Press.
Appendix

Logs

Supplementary logs and diagnostics

Select runs from the exported logs. The default view starts with Default.

Runs (1 selected)
Selected runs: Default
Average Across Benchmarks
HMMT 2025
Brumo 2025
BeyondAIME
AIME 2024
AIME 2025
Teacher KL Divergence
Gradient Norm
Policy Entropy
Mean Response Length
Rollout/Learner Log-PPL Drift
Rollout RS Masked Fraction
Rollout/Learner PPL Ratio