RL 70B GRPO training is slower than reported

Describe the bug

When training GRPO with Llama 3.1 70B, we observe moderately slower performance compared to what is reported in the NVIDIA NeMo blog post.

This slowdown occurs with both the DTensor backend and the Megatron backend.

Model	Backend	Nodes	GPUs per node	Total step time (s)	Policy training (s)	Refit (s)	Generation (s)	Get logprobs (s)	Avg. generated tokens per sample
Llama 3.1-8B Instruct	Megatron	1	8	112	28	5	58	18	795
	Megatron (Our rerun)	1	8	116	27	8	56	18	811
	PyT DTensor	1	8	122	38	4	57	19	777
	PyT DTensor (Our rerun)	1	8	129	40	6	55	19	802
Llama 3.1-70B Base	Megatron	8	8	147	28	14	84	18	398
	Megatron (Our rerun)	8	8	178	31	26	87	20	396
	PyT DTensor	8	8	230	97	15	82	28	395
	PyT DTensor (Our rerun)	8	8	266	117	20	88	29	411

8B GRPO: almost as fast as reported in the blog post
70B GRPO: Noticeably slower for both DTensor and Megatron backend. In particular,
- refit is ~2x slower on both DTensor and Megatron backends
- Policy training is also slower with DTensor

Do you have any suggestions or insights into why the refit step is much slower in our setup?

Steps/Code to reproduce bug

We are using the configurations described in the blog post, which corresponds to:

## dtensor
uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml \
    policy.model_name=meta-llama/Llama-3.1-70B policy.tokenizer.name=meta-llama/Llama-3.1-70B-Instruct \
    policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=4096 \
    cluster.num_nodes=8 policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 \
    policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=False \
    loss_fn.use_importance_sampling_correction=True \
    policy.sequence_packing.enabled=False \
    policy.dynamic_batching.enabled=True  # as reported in the blog, we use dynamic batching instead of sequence packing

## megatron
uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_70B_megatron.yaml \
    policy.model_name=meta-llama/Llama-3.1-70B policy.tokenizer.name=meta-llama/Llama-3.3-70B-Instruct \
    policy.sequence_packing.enabled=True loss_fn.use_importance_sampling_correction=True

Expected behavior

GRPO training should be as fast.

Environment overview (please complete the following information)

Environment location: Inhouse Slurm cluster with H100 cards
Method of install: Internal Dockerfile to make NemoRL run on our cluster. NemoRL codebase is strictly based on v0.3.1 release.

Environment details

OS version: Dockerfile is based on nvcr.io/nvidia/cuda-dl-base:25.05-cuda12.9-devel-ubuntu24.04
PyTorch version: 2.7.0
Python version: 3.12

Additional context

GPU is NVIDIA H100 80GB

Aug 21 '25 04:08 butsugiri

Hi, I just rebuilt the Dockerfile on the v0.3.1 branch and reran the Megatron 70b experiment, and my results matched with what is reported in the blog. Could you share more details on the differences between your Dockerfile and the one in v0.3.1? Also, which steps did you measure to gather the results in the table? Note that the results reported in the blog are averages from steps 22-29.

Aug 27 '25 17:08 ashors1

Hi, thank you for your support.

Dockerfile

Please find our Dockerfile below. The main differences are:

Installing libibverbs-dev via apt, which is required to enable the IB interconnect in our cluster
Removing Nsight, which caused a compile error when we attempted to build the image
Running uv add to incorporate in-house dependencies
- We make sure this does not override core libraries like PyTorch, vLLM, Flash-attention, etc
Explicitly cloning NeMo-RL v0.3.1, to avoid conflicts with our internal modifications made in nemo_rl_extensions
- vanilla NeMo-RL is in /opt/nemo-rl whereas our extensions are in /code

ARG BASE_IMAGE=nvcr.io/nvidia/cuda-dl-base:25.05-cuda12.9-devel-ubuntu24.04
FROM ${BASE_IMAGE} AS base

# For Japanese character
ENV LESSCHARSET=utf-8

USER root

RUN <<"EOF" bash -exu -o pipefail
export DEBIAN_FRONTEND=noninteractive
export TZ=Asia/Tokyo

apt-get update
apt-get install -y --no-install-recommends \
    jq \
    curl \
    git \
    rsync \
    wget \
    less \
    vim \
    libibverbs-dev  # for infiniband

apt-get clean
rm -rf /var/lib/apt/lists/*
EOF

ARG UV_VERSION=0.7.2
ARG PYTHON_VERSION=3.12
ENV PATH="/root/.local/bin:$PATH"
RUN curl -LsSf https://astral.sh/uv/${UV_VERSION}/install.sh | sh && \
    uv python install ${PYTHON_VERSION}

ENV RAY_USAGE_STATS_ENABLED=0
ENV NEMO_RL_VENV_DIR=/opt/ray_venvs

FROM base AS hermetic

WORKDIR /opt/nemo-rl

ARG NEMO_RL_REPO=https://github.com/NVIDIA/NeMo-RL.git
ARG NEMO_RL_BRANCH_OR_COMMIT=v0.3.1
RUN git clone ${NEMO_RL_REPO} . && \
    git checkout ${NEMO_RL_BRANCH_OR_COMMIT} && \
    git submodule update --init --recursive

ENV UV_PROJECT_ENVIRONMENT=/opt/nemo_rl_venv
ENV UV_LINK_MODE=copy

COPY requirements.nemo_rl.txt /tmp

RUN <<"EOF" bash -exu
uv venv ${UV_PROJECT_ENVIRONMENT}

VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT uv pip install --link-mode symlink setuptools torch==2.7.0 psutil ninja --torch-backend=cu128
VIRTUAL_ENV=$UV_PROJECT_ENVIRONMENT MAX_JOBS=16 uv pip install -v --link-mode symlink flash-attn==2.7.4.post1 --no-build-isolation
EOF

RUN <<"EOF" bash -exu
# uv sync has a more reliable resolver than simple uv pip install which can fail

# Sync each training + inference backend one at a time (since they may conflict)
# to warm the uv cache, then at the end just sync the default dependencies.
# Do everything in one layer to prevent large layers.

# The venv is symlinked to avoid bloating the layer size
uv add -v --index 'internal-pypi=https://www.example.com' -r /tmp/requirements.nemo_rl.txt
uv sync -v --link-mode symlink --locked --no-install-project
uv sync -v --link-mode symlink --locked --extra vllm --no-install-project
uv sync -v --link-mode symlink --locked --extra mcore --no-install-project
uv sync -v --link-mode symlink --locked --all-groups --no-install-project
EOF

ENV PATH="/opt/nemo_rl_venv/bin:$PATH"
ENV NEMO_RL_BRANCH_OR_COMMIT=${NEMO_RL_BRANCH_OR_COMMIT:-<unknown>}

COPY examples /code/examples
COPY nemo_rl_extensions /code/nemo_rl_extensions
RUN UV_LINK_MODE=symlink uv run nemo_rl/utils/prefetch_venvs.py

Step Number

Regarding the step number, we collected timings for steps 22–29. As shown in the images below, total time ranged from 170–180 in the Megatron setting and from 250–270 in the DTensor setting.

Aug 28 '25 02:08 butsugiri

@butsugiri Hi, sorry for the delay, I built the container based on the dockerfile you provided and ran the test again, I can still reproduce our results in the blog. Here is a screenshot of megatron path run

Here is my script

#!/bin/bash

# git clone --recursive --branch v0.3.1 https://github.com/NVIDIA-NeMo/RL.git
# cd RL
# git checkout 27d0af945b96c5bf9fa928bf45fd1a6ce016632d && git submodule update --init --recursive


export HF_HOME=<your HF_HOME>
export WANDB_API_KEY=<your WANDB API KEY>
export EXP_SUFFIX="grpo_math_70b_v031_mcore_customer"
export WANDB_NAME=${EXP_SUFFIX}
export WANDB_PROJECT="nemo-rl-grpo-dev-guyueh"
export CHECKPOINT_DIR="results/${WANDB_NAME}"
export NUM_ACTOR_NODES=8
export RAY_DEDUP_LOGS=0
export BASE_LOG_DIR="logs/${WANDB_NAME}"
export CONTAINER=<container tag built from dockefile>
export MOUNTS="/lustre:/lustre:ro,${PWD}:/opt/nemo-rl"

export COMMAND="uv run ./examples/run_grpo_math.py \
    --config examples/configs/grpo_math_70B_megatron.yaml \
    policy.model_name=meta-llama/Llama-3.1-70B \
    policy.tokenizer.name=meta-llama/Llama-3.1-70B-Instruct \
    policy.sequence_packing.enabled=True \
    loss_fn.use_importance_sampling_correction=True \
    grpo.max_num_steps=50 \
    cluster.num_nodes=${NUM_ACTOR_NODES} \
    checkpointing.checkpoint_dir=${CHECKPOINT_DIR} \
    logger.wandb_enabled=True \
    logger.wandb.name=${WANDB_NAME} \
    logger.wandb.project=${WANDB_PROJECT}"

sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=coreai_dlalgo_nemorl \
    --job-name=${WANDB_NAME} \
    --partition=batch \
    --gres=gpu:8 \
    --time=04:00:00 \
    ray.sub

I did expect this result before I ran. Because it seems your container is not significantly different from ours and you are not overriding any dependencies.

The most possible explanation for the result gap is the different network performance. Inferring from the fact that your 70B refit time and policy train/logprob time are all slightly slower than our benchmark result, but generation time is similar, I suspect it is because of network performance, because in this example generation doesn't require cross-node communication but training and refit do.

Nov 11 '25 04:11 guyueh1

@guyueh1 Hi, thank you for reproducing our environment and looking into our issue. Regarding the network performance issue, we plan to

conduct an additional investigation with nsys on our side.
use soon-to-be-released (which I assume…) RL v0.4.0 and see the training efficiency

Nov 20 '25 01:11 butsugiri