trl when use GRPO+ deepspeed_zero3 + ds3_gather_for_generation=False, stuck in the training stage, step is still 0 after an hour

Reproduction

    training_args = GRPOConfig(
        # use_vllm = True, # use vLLM for fast inference!
        # vllm_mode="colocate",
        # vllm_tensor_parallel_size=8,
        # vllm_server_base_url='http://127.0.0.1:8000',
        learning_rate = 5e-6,
        adam_beta1 = 0.9,
        adam_beta2 = 0.99,
        weight_decay = 0.1,
        warmup_ratio = 0.1,
        lr_scheduler_type = "cosine",
        optim = "paged_adamw_8bit",
        logging_steps = 1,
        bf16=True,
        # bf16=is_bfloat16_supported(),
        # fp16=not is_bfloat16_supported(),
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 1, # Increase to 4 for smoother training
        num_generations = 2, # Decrease if out of memory
        max_prompt_length = 4096,
        max_completion_length = 4096,
        # num_train_epochs = 1, # Set to 1 for a full training run
        max_steps = 250,
        save_steps = 250,
        max_grad_norm = 0.1,
        report_to = "none", # Can use Weights & Biases
        output_dir = save_path,
        # deepspeed="./config_file/deepspeed/ds_z3_offload_config.json",
        auto_find_batch_size=False,
        ds3_gather_for_generation=False
    )
    trainer = GRPOTrainer(
        model = model_path,
        # processing_class = tokenizer,
        reward_funcs =  reward_json_and_answer,
        args = training_args,
        train_dataset = grpo_dataset,
    )
    trainer.train()

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: cpu
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  activation_checkpointing:
    partition_activations: true
  bf16:
    enabled: true
    loss_scale: 0
    loss_scale_window: 1000
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file deepspeed_zero3.yaml  grpo_train.py

System Info

INFO 08-06 19:27:07 [init.py:235] Automatically detected platform cuda.

Copy-paste the following information when reporting an issue:

Platform: Linux-5.10.134-18.al8.x86_64-x86_64-with-glibc2.32
Python version: 3.12.0
TRL version: 0.20.0
PyTorch version: 2.7.1
accelerator(s): NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB
Transformers version: 4.53.3
Accelerate version: 1.8.1
Accelerate config: not found
Datasets version: 3.6.0
HF Hub version: 0.34.3
bitsandbytes version: 0.46.1
DeepSpeed version: 0.17.1
Diffusers version: 0.34.0
Liger-Kernel version: 0.5.10
LLM-Blender version: not installed
OpenAI version: 1.90.0
PEFT version: 0.17.0
vLLM version: 0.10.0

Checklist

[x] I have checked that my issue isn't already filed (see open issues)
[x] I have included my system information
[x] Any code provided is minimal, complete, and reproducible (more on MREs)
[x] Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
[x] Any traceback provided is complete

Aug 06 '25 11:08 nstl-zyb

After 10n hours, it is normal,but GPU memory steal high （73G/80G), when max_prompt_length = 4096, max_completion_length = 4096.

Aug 07 '25 01:08 nstl-zyb

I also got stuck. What tool did you use in the end

Nov 14 '25 11:11 1luik