Request for training recipe of on-policy KD
Hello! Thank you for the excellent work. I’m trying to reproduce Table 1 from the post (https://github.com/NVIDIA-NeMo/RL/discussions/1445) and would be very grateful for any guidance. A few targeted questions:
-
Could you share the training recipe used for on-policy KD in the post? (Qwen3-32B teacher, Qwen3-4B student, how many gpus did you used?)
-
If it's sharable.. could you share training logs? I'd like to refer the training curves of losses/evaluation metrics to reproduce the experiment. It would be a great help.
-
Could you tell some practical tips for on-policy kd training? for example ..
- The sensitivity for learning rate
- KL loss type (mixed for revered, kl weight)
- The appropriate performance gap between student and teacher?
- Tricks to avoid model collapse
Sorry for many questions and thanks again for sharing the great work on open-source community!
Best regards,
Jihwan
Thanks for your interest, @JihwanEom. Since some experiments in the blog were conducted on an early version of the PR, we need some time to organize our recipe before sharing it with you.
For part 3, I can give some quick answers based on our recent experiments:
-
For lr, 2e-5 works well in our experiments, while 4e-5 performs worse at the beginning but aligns with 2e-5 after about 100 steps.
-
Reverse KL works better.
-
We haven't tried many student–teacher pairs, but for math tasks, using Qwen3-4B as the teacher can also improve the performance of 1.7B-Base.
-
Top-k should be appropriate, at least 64. A larger batch size, such as 512, seems to make training more stable.
But please note that all the tips come from our limited experimental observations, so they may not work in all scenarios.
Hello, @zpqiu, thanks for sharing these valuable insights! They’ll be hugely helpful for my experiments :D
I have a few follow-up questions:
Q1. Could you share the training recipe once you’ve finished organizing it? I agree it won’t be a definitive answer, but it would be a great baseline for improving my custom models.
Q2. You mentioned that a 2e-5 learning rate worked well—could you share the batch size you used with 2e-5?
Q3. Related to Q2: do you recommend a larger batch size (e.g., 512) for more stable training?
Q4. Do you have plans to support on-policy knowledge distillation for VLMs? To the best of my knowledge, this shouldn’t be too far from text-only KD since the difference is mainly conditioning on both text and image. My impression is that feeding both student and teacher the same image-text inputs and backpropagating the loss on the generated responses should suffice. Am I missing anything? Any advice for implementing VLM on-policy KD would be appreciated!
Q5. (Slightly tangential to the current topic :D) Do you plan to support a batch-invariance feature in vLLM (https://docs.vllm.ai/en/latest/features/batch_invariance/#batch-invariance) to enable “true” on-policy KD? Thanks again for all the help!
Best, Jihwan
Hi @JihwanEom , I has tested the following recipe.
defaults: distillation_math.yaml
distillation:
num_prompts_per_step: 512
max_num_steps: 500
val_batch_size: 512
val_period: 20
loss_fn:
kl_type: reverse
checkpointing:
model_save_format: "torch_save"
keep_top_k: 3
checkpoint_dir: checkpoints/distillation-qwen3-32b-to-4b-base-long
policy:
model_name: Qwen/Qwen3-4B-Base
train_global_batch_size: 512
max_total_sequence_length: 20480
generation:
vllm_cfg:
tensor_parallel_size: 1
gpu_memory_utilization: 0.7
teacher:
model_name: Qwen/Qwen3-32B
max_total_sequence_length: 20480
logger:
log_dir: logs/distillation-qwen3-32b-to-4b-base-long
wandb:
project: nemo-rl
name: distillation-qwen3-32b-to-4b-base-long
cluster:
num_nodes: 2
The training curves are like:
At step 60, the score on AIME 2024 can reach 37.29, and the score on AIME 2025 can reach 32.71. The evaluation settings are:
max_new_tokens=20480 temperature=0.6 top_p=1.0 top_k=-1 seed=42
metric=pass@1 num_tests_per_prompt=16
Therefore, the above configuration should roughly reproduce the results in the table of our blog, which can serve as a reference for you :)
Regarding Q4, yes, I don’t think there’s any difference between VLM and LLM in terms of on-policy distillation except for the input. Have you tried directly changing the model name to an VLM and modifying the multimodal dataset?
Regarding Q5, my understanding is that as long as Nemo-RL supports this vllm feature on the generation worker, on-policy distillation will naturally be supported as well. so we could ask @terrykong about his plans.
Hello @zpqiu, thanks for the quick and kind response! The recipe you shared will be super helpful for my research :D
For Q4) I’m still new to NeMo-RL, so I haven’t tried on-policy KD with a VLM yet. I’ll carefully review the documentations and try it once I’m comfortable.
For Q5) Sounds great! For my understanding, once I upgrade vLLM to a version that supports the batch-invariance flag (maybe 0.11.1?) on the generation worker, NeMo-RL will naturally support true on-policy KD. That's so nice.
You’re welcome. If you run into any issues with VLM, feel free to open a new issue so we can track it.
Hello @zpqiu! I’m making progress on training on-policy KD for a VLM with your great recipe :D It seems like it can be implemented by directly inputting VLM features, as you said.
Even though it’s a bit off-topic, can I use the async-rollout feature for on-policy KD?
As for GRPO, to the best of my understanding, async GRPO is supported in the latest main branch, and I believe it may not be very different for on-policy KD.
Could you confirm is there anything I missed? I hope it to be easily supported.. Always thanks for your great help!
Best, Jihwan
Sorry for the delay @JihwanEom .
can I use the async-rollout feature for on-policy KD?
Sure, I think the on-policy KD also supports async rollout. Have you tried it?