trl
trl copied to clipboard
when use GRPO+ deepspeed_zero3 + ds3_gather_for_generation=False, stuck in the training stage, step is still 0 after an hour
Reproduction
training_args = GRPOConfig(
# use_vllm = True, # use vLLM for fast inference!
# vllm_mode="colocate",
# vllm_tensor_parallel_size=8,
# vllm_server_base_url='http://127.0.0.1:8000',
learning_rate = 5e-6,
adam_beta1 = 0.9,
adam_beta2 = 0.99,
weight_decay = 0.1,
warmup_ratio = 0.1,
lr_scheduler_type = "cosine",
optim = "paged_adamw_8bit",
logging_steps = 1,
bf16=True,
# bf16=is_bfloat16_supported(),
# fp16=not is_bfloat16_supported(),
per_device_train_batch_size = 1,
gradient_accumulation_steps = 1, # Increase to 4 for smoother training
num_generations = 2, # Decrease if out of memory
max_prompt_length = 4096,
max_completion_length = 4096,
# num_train_epochs = 1, # Set to 1 for a full training run
max_steps = 250,
save_steps = 250,
max_grad_norm = 0.1,
report_to = "none", # Can use Weights & Biases
output_dir = save_path,
# deepspeed="./config_file/deepspeed/ds_z3_offload_config.json",
auto_find_batch_size=False,
ds3_gather_for_generation=False
)
trainer = GRPOTrainer(
model = model_path,
# processing_class = tokenizer,
reward_funcs = reward_json_and_answer,
args = training_args,
train_dataset = grpo_dataset,
)
trainer.train()
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: cpu
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
activation_checkpointing:
partition_activations: true
bf16:
enabled: true
loss_scale: 0
loss_scale_window: 1000
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --config_file deepspeed_zero3.yaml grpo_train.py
System Info
INFO 08-06 19:27:07 [init.py:235] Automatically detected platform cuda.
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.10.134-18.al8.x86_64-x86_64-with-glibc2.32
- Python version: 3.12.0
- TRL version: 0.20.0
- PyTorch version: 2.7.1
- accelerator(s): NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB
- Transformers version: 4.53.3
- Accelerate version: 1.8.1
- Accelerate config: not found
- Datasets version: 3.6.0
- HF Hub version: 0.34.3
- bitsandbytes version: 0.46.1
- DeepSpeed version: 0.17.1
- Diffusers version: 0.34.0
- Liger-Kernel version: 0.5.10
- LLM-Blender version: not installed
- OpenAI version: 1.90.0
- PEFT version: 0.17.0
- vLLM version: 0.10.0
Checklist
- [x] I have checked that my issue isn't already filed (see open issues)
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible (more on MREs)
- [x] Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- [x] Any traceback provided is complete
After 10n hours, it is normal,but GPU memory steal high (73G/80G), when max_prompt_length = 4096, max_completion_length = 4096.
I also got stuck. What tool did you use in the end