guyueh1 issues

Results 20 issues of


                                            guyueh1

Async checkpoint saving in GRPO loop

**Is your feature request related to a problem? Please describe.** Currently async_save is [disabled](https://github.com/NVIDIA-NeMo/RL/blame/8762f575c0d11aeb8a073a64e49cf433eb77c94a/nemo_rl/models/policy/megatron_policy_worker.py#L654) in mcore path checkpoint, serialization takes a long time with training paused; should test async_save and...

enhancement

Performance

research

mcore

checkpointing

policy.train slow at >=32 nodes b/c workers start at different time

When number of nodes is >=32 and calling `policy.train`, on some ranks it takes a long time to perform the initial synchronization (AllReduce nccl kernel) but on some other ranks...

Performance

mcore

v0.5 Performance Improvement on Long Max-Seqlen

Tracking items to improve performance for long max-seqlen workloads, including training and generation performance at long context.

Performance

Add more detailed sequence length distribution logging

**Is your feature request related to a problem? Please describe.** Right now nemo-rl logs the mean and max generated tokens per sample every step, but those two metrics cannot fully...

enhancement

Separate data/control planes in the token_id passing in GRPO

**Is your feature request related to a problem? Please describe.** Right now in nemo-rl GRPO, the generation workers will return the token_ids to header node on cpu, and header node...

enhancement

Performance

Blackwell Low-Precision GRPO Recipe Productization

Issue to track low-precision GRPO recipe testing and perf optimization

Low Precision

B200

GB200

Hopper FP8 GRPO Recipe Productization

**Is your feature request related to a problem? Please describe.** Release code for FP8 rollout + FP8 training (blockwise scaling) * Convergence test and downstream evaluation compared w/ BF16 for...

inference

mcore

Low Precision

guyueh1

Async checkpoint saving in GRPO loop

policy.train slow at >=32 nodes b/c workers start at different time

v0.5 Performance Improvement on Long Max-Seqlen

Add more detailed sequence length distribution logging

Separate data/control planes in the token_id passing in GRPO

Blackwell Low-Precision GRPO Recipe Productization

Hopper FP8 GRPO Recipe Productization

feat: Random dataset with specified input and output sequence length

Add performance scripts for DAPO algorithm

v0.5 improvements to MoE perf (Deepseek V3)