verl veRL-SGLang slower than expected (GH200)

veRL-SGLang on GH200 aarch64 cluster, I got installation working and standalone SGLang works as well as veRL-vLLM, however it seems something is not working well with veRL-SGLang as it's significantly slower (uses much less memory though), I tested torch_memory_saver and it works standalone, if you have anything that I should debug can test on our end

Versions

flash_attn                        2.7.3
flash_attn_3                      3.0.0b1
flashinfer-python                 0.2.2.post1
sgl-kernel                        0.0.9.post2
sglang                            0.4.5.post3
torch_memory_saver                0.0.5
verl                              0.3.0.post1
vllm                              0.8.3

FA3, flashinfer, sgl-kernel, sglang, torch_memory_saver, verl, and vllm built from source

veRL Trainer

vLLM equivalent with actor_rollout_ref.rollout.name=vllm

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$(pwd)/data/train.parquet \
    data.val_files=$(pwd)/data/test.parquet \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_example_gsm8k' \
    trainer.experiment_name='grpo_GSM8k_qwen0.5_test' \
    trainer.n_gpus_per_node=4 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 \
    "$@"

Blue SGLang, Pink FA2-vLLM, Green FA3-vLLM (I don't know if veRL is actually exploiting FA3)

More graphs

Log

++ pwd
++ pwd
+ python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files=/workspace/verl/data/train.parquet data.val_files=/workspace/verl/data/test.parquet data.train_batch_size=1024 data.max_prompt_length=1024 data.max_response_length=1024 data.filter_overlong_prompts=True data.truncation=error actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=256 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 actor_rollout_ref.actor.use_kl_loss=True actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 actor_rollout_ref.rollout.tensor_model_parallel_size=2 actor_rollout_ref.rollout.name=sglang actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.n=5 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 actor_rollout_ref.ref.fsdp_config.param_offload=True algorithm.kl_ctrl.kl_coef=0.001 trainer.critic_warmup=0 'trainer.logger=[console,wandb]' trainer.project_name=verl_grpo_example_gsm8k trainer.experiment_name=grpo_GSM8k_qwen0.5_test trainer.n_gpus_per_node=4 trainer.nnodes=1 trainer.save_freq=-1 trainer.test_freq=5 trainer.total_epochs=15
2025-04-23 01:44:20,852 INFO worker.py:1852 -- Started a local Ray instance.
(TaskRunner pid=16292) {'actor_rollout_ref': {'actor': {'checkpoint': {'contents': ['model',
(TaskRunner pid=16292)                                                              'hf_model',
(TaskRunner pid=16292)                                                              'optimizer',
(TaskRunner pid=16292)                                                              'extra']},
(TaskRunner pid=16292)                                  'clip_ratio': 0.2,
(TaskRunner pid=16292)                                  'entropy_coeff': 0.001,
(TaskRunner pid=16292)                                  'fsdp_config': {'fsdp_size': -1,
(TaskRunner pid=16292)                                                  'optimizer_offload': False,
(TaskRunner pid=16292)                                                  'param_offload': False,
(TaskRunner pid=16292)                                                  'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=16292)                                  'grad_clip': 1.0,
(TaskRunner pid=16292)                                  'kl_loss_coef': 0.001,
(TaskRunner pid=16292)                                  'kl_loss_type': 'low_var_kl',
(TaskRunner pid=16292)                                  'optim': {'lr': 1e-06,
(TaskRunner pid=16292)                                            'lr_warmup_steps': -1,
(TaskRunner pid=16292)                                            'lr_warmup_steps_ratio': 0.0,
(TaskRunner pid=16292)                                            'min_lr_ratio': None,
(TaskRunner pid=16292)                                            'total_training_steps': -1,
(TaskRunner pid=16292)                                            'warmup_style': 'constant'},
(TaskRunner pid=16292)                                  'ppo_epochs': 1,
(TaskRunner pid=16292)                                  'ppo_max_token_len_per_gpu': 16384,
(TaskRunner pid=16292)                                  'ppo_micro_batch_size': None,
(TaskRunner pid=16292)                                  'ppo_micro_batch_size_per_gpu': 40,
(TaskRunner pid=16292)                                  'ppo_mini_batch_size': 256,
(TaskRunner pid=16292)                                  'shuffle': False,
(TaskRunner pid=16292)                                  'strategy': 'fsdp',
(TaskRunner pid=16292)                                  'ulysses_sequence_parallel_size': 1,
(TaskRunner pid=16292)                                  'use_dynamic_bsz': False,
(TaskRunner pid=16292)                                  'use_kl_loss': True,
(TaskRunner pid=16292)                                  'use_torch_compile': True},
(TaskRunner pid=16292)                        'hybrid_engine': True,
(TaskRunner pid=16292)                        'model': {'enable_gradient_checkpointing': True,
(TaskRunner pid=16292)                                  'external_lib': None,
(TaskRunner pid=16292)                                  'override_config': {},
(TaskRunner pid=16292)                                  'path': 'Qwen/Qwen2.5-0.5B-Instruct',
(TaskRunner pid=16292)                                  'use_remove_padding': True},
(TaskRunner pid=16292)                        'ref': {'fsdp_config': {'param_offload': True,
(TaskRunner pid=16292)                                                'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=16292)                                'log_prob_max_token_len_per_gpu': 16384,
(TaskRunner pid=16292)                                'log_prob_micro_batch_size': None,
(TaskRunner pid=16292)                                'log_prob_micro_batch_size_per_gpu': 40,
(TaskRunner pid=16292)                                'log_prob_use_dynamic_bsz': False,
(TaskRunner pid=16292)                                'ulysses_sequence_parallel_size': 1},
(TaskRunner pid=16292)                        'rollout': {'disable_log_stats': True,
(TaskRunner pid=16292)                                    'do_sample': True,
(TaskRunner pid=16292)                                    'dtype': 'bfloat16',
(TaskRunner pid=16292)                                    'enable_chunked_prefill': True,
(TaskRunner pid=16292)                                    'enforce_eager': False,
(TaskRunner pid=16292)                                    'free_cache_engine': False,
(TaskRunner pid=16292)                                    'gpu_memory_utilization': 0.6,
(TaskRunner pid=16292)                                    'ignore_eos': False,
(TaskRunner pid=16292)                                    'load_format': 'dummy_dtensor',
(TaskRunner pid=16292)                                    'log_prob_max_token_len_per_gpu': 16384,
(TaskRunner pid=16292)                                    'log_prob_micro_batch_size': None,
(TaskRunner pid=16292)                                    'log_prob_micro_batch_size_per_gpu': 40,
(TaskRunner pid=16292)                                    'log_prob_use_dynamic_bsz': False,
(TaskRunner pid=16292)                                    'max_model_len': None,
(TaskRunner pid=16292)                                    'max_num_batched_tokens': 8192,
(TaskRunner pid=16292)                                    'max_num_seqs': 1024,
(TaskRunner pid=16292)                                    'n': 5,
(TaskRunner pid=16292)                                    'name': 'sglang',
(TaskRunner pid=16292)                                    'prompt_length': 1024,
(TaskRunner pid=16292)                                    'response_length': 1024,
(TaskRunner pid=16292)                                    'temperature': 1.0,
(TaskRunner pid=16292)                                    'tensor_model_parallel_size': 2,
(TaskRunner pid=16292)                                    'top_k': -1,
(TaskRunner pid=16292)                                    'top_p': 1,
(TaskRunner pid=16292)                                    'use_fire_sampling': False,
(TaskRunner pid=16292)                                    'val_kwargs': {'do_sample': False,
(TaskRunner pid=16292)                                                   'n': 1,
(TaskRunner pid=16292)                                                   'temperature': 0,
(TaskRunner pid=16292)                                                   'top_k': -1,
(TaskRunner pid=16292)                                                   'top_p': 1.0}}},
(TaskRunner pid=16292)  'algorithm': {'adv_estimator': 'grpo',
(TaskRunner pid=16292)                'gamma': 1.0,
(TaskRunner pid=16292)                'kl_ctrl': {'kl_coef': 0.001, 'type': 'fixed'},
(TaskRunner pid=16292)                'kl_penalty': 'kl',
(TaskRunner pid=16292)                'lam': 1.0},
(TaskRunner pid=16292)  'critic': {'checkpoint': {'contents': ['model',
(TaskRunner pid=16292)                                         'hf_model',
(TaskRunner pid=16292)                                         'optimizer',
(TaskRunner pid=16292)                                         'extra']},
(TaskRunner pid=16292)             'cliprange_value': 0.5,
(TaskRunner pid=16292)             'forward_max_token_len_per_gpu': 32768,
(TaskRunner pid=16292)             'forward_micro_batch_size': None,
(TaskRunner pid=16292)             'forward_micro_batch_size_per_gpu': None,
(TaskRunner pid=16292)             'grad_clip': 1.0,
(TaskRunner pid=16292)             'model': {'enable_gradient_checkpointing': True,
(TaskRunner pid=16292)                       'external_lib': None,
(TaskRunner pid=16292)                       'fsdp_config': {'fsdp_size': -1,
(TaskRunner pid=16292)                                       'optimizer_offload': False,
(TaskRunner pid=16292)                                       'param_offload': False,
(TaskRunner pid=16292)                                       'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=16292)                       'override_config': {},
(TaskRunner pid=16292)                       'path': '~/models/deepseek-llm-7b-chat',
(TaskRunner pid=16292)                       'tokenizer_path': 'Qwen/Qwen2.5-0.5B-Instruct',
(TaskRunner pid=16292)                       'use_remove_padding': False},
(TaskRunner pid=16292)             'optim': {'lr': 1e-05,
(TaskRunner pid=16292)                       'lr_warmup_steps_ratio': 0.0,
(TaskRunner pid=16292)                       'min_lr_ratio': None,
(TaskRunner pid=16292)                       'total_training_steps': -1,
(TaskRunner pid=16292)                       'warmup_style': 'constant'},
(TaskRunner pid=16292)             'ppo_epochs': 1,
(TaskRunner pid=16292)             'ppo_max_token_len_per_gpu': 32768,
(TaskRunner pid=16292)             'ppo_micro_batch_size': None,
(TaskRunner pid=16292)             'ppo_micro_batch_size_per_gpu': None,
(TaskRunner pid=16292)             'ppo_mini_batch_size': 256,
(TaskRunner pid=16292)             'shuffle': False,
(TaskRunner pid=16292)             'strategy': 'fsdp',
(TaskRunner pid=16292)             'ulysses_sequence_parallel_size': 1,
(TaskRunner pid=16292)             'use_dynamic_bsz': False},
(TaskRunner pid=16292)  'custom_reward_function': {'name': 'compute_score', 'path': None},
(TaskRunner pid=16292)  'data': {'filter_overlong_prompts': True,
(TaskRunner pid=16292)           'image_key': 'images',
(TaskRunner pid=16292)           'max_prompt_length': 1024,
(TaskRunner pid=16292)           'max_response_length': 1024,
(TaskRunner pid=16292)           'prompt_key': 'prompt',
(TaskRunner pid=16292)           'return_raw_chat': False,
(TaskRunner pid=16292)           'return_raw_input_ids': False,
(TaskRunner pid=16292)           'shuffle': True,
(TaskRunner pid=16292)           'tokenizer': None,
(TaskRunner pid=16292)           'train_batch_size': 1024,
(TaskRunner pid=16292)           'train_files': '/workspace/verl/data/train.parquet',
(TaskRunner pid=16292)           'truncation': 'error',
(TaskRunner pid=16292)           'val_batch_size': None,
(TaskRunner pid=16292)           'val_files': '/workspace/verl/data/test.parquet'},
(TaskRunner pid=16292)  'reward_model': {'enable': False,
(TaskRunner pid=16292)                   'forward_max_token_len_per_gpu': 32768,
(TaskRunner pid=16292)                   'max_length': None,
(TaskRunner pid=16292)                   'micro_batch_size': None,
(TaskRunner pid=16292)                   'micro_batch_size_per_gpu': None,
(TaskRunner pid=16292)                   'model': {'external_lib': None,
(TaskRunner pid=16292)                             'fsdp_config': {'fsdp_size': -1,
(TaskRunner pid=16292)                                             'param_offload': False,
(TaskRunner pid=16292)                                             'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=16292)                             'input_tokenizer': 'Qwen/Qwen2.5-0.5B-Instruct',
(TaskRunner pid=16292)                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
(TaskRunner pid=16292)                             'use_remove_padding': False},
(TaskRunner pid=16292)                   'reward_manager': 'naive',
(TaskRunner pid=16292)                   'strategy': 'fsdp',
(TaskRunner pid=16292)                   'ulysses_sequence_parallel_size': 1,
(TaskRunner pid=16292)                   'use_dynamic_bsz': False},
(TaskRunner pid=16292)  'trainer': {'balance_batch': True,
(TaskRunner pid=16292)              'critic_warmup': 0,
(TaskRunner pid=16292)              'default_hdfs_dir': None,
(TaskRunner pid=16292)              'default_local_dir': 'checkpoints/verl_grpo_example_gsm8k/grpo_GSM8k_qwen0.5_test',
(TaskRunner pid=16292)              'del_local_ckpt_after_load': False,
(TaskRunner pid=16292)              'experiment_name': 'grpo_GSM8k_qwen0.5_test',
(TaskRunner pid=16292)              'logger': ['console', 'wandb'],
(TaskRunner pid=16292)              'max_actor_ckpt_to_keep': None,
(TaskRunner pid=16292)              'max_critic_ckpt_to_keep': None,
(TaskRunner pid=16292)              'n_gpus_per_node': 4,
(TaskRunner pid=16292)              'nnodes': 1,
(TaskRunner pid=16292)              'project_name': 'verl_grpo_example_gsm8k',
(TaskRunner pid=16292)              'resume_from_path': None,
(TaskRunner pid=16292)              'resume_mode': 'auto',
(TaskRunner pid=16292)              'save_freq': -1,
(TaskRunner pid=16292)              'test_freq': 5,
(TaskRunner pid=16292)              'total_epochs': 15,
(TaskRunner pid=16292)              'total_training_steps': None,
(TaskRunner pid=16292)              'val_generations_to_log_to_wandb': 0}}
(TaskRunner pid=16292) [validate_config] All configuration checks passed successfully!
(TaskRunner pid=16292) dataset len: 7473
(TaskRunner pid=16292) filter dataset len: 7473
(TaskRunner pid=16292) dataset len: 1319
(TaskRunner pid=16292) DeprecationWarning: `ray.state.available_resources_per_node` is a private attribute and access will be removed in a future Ray version.
(TaskRunner pid=16292) filter dataset len: 1319
(TaskRunner pid=16292) Size of train dataloader: 7
(TaskRunner pid=16292) Total training steps: 105
(WorkerDict pid=17885) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
(WorkerDict pid=17885) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention
(WorkerDict pid=17885) [rank3]:[W423 01:45:07.460585879 ProcessGroupNCCL.cpp:4571] [PG ID 0 PG GUID 0 Rank 3]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
(WorkerDict pid=17565) Model config after override: Qwen2Config {
(WorkerDict pid=17565)   "architectures": [
(WorkerDict pid=17565)     "Qwen2ForCausalLM"
(WorkerDict pid=17565)   ],
(WorkerDict pid=17565)   "attention_dropout": 0.0,
(WorkerDict pid=17565)   "eos_token_id": 151645,
(WorkerDict pid=17565)   "hidden_act": "silu",
(WorkerDict pid=17565)   "hidden_size": 896,
(WorkerDict pid=17565)   "initializer_range": 0.02,
(WorkerDict pid=17565)   "intermediate_size": 4864,
(WorkerDict pid=17565)   "max_position_embeddings": 32768,
(WorkerDict pid=17565)   "max_window_layers": 21,
(WorkerDict pid=17565)   "model_type": "qwen2",
(WorkerDict pid=17565)   "num_attention_heads": 14,
(WorkerDict pid=17565)   "num_hidden_layers": 24,
(WorkerDict pid=17565)   "num_key_value_heads": 2,
(WorkerDict pid=17565)   "pad_token_id": 151643,
(WorkerDict pid=17565)   "rms_norm_eps": 1e-06,
(WorkerDict pid=17565)   "rope_scaling": null,
(WorkerDict pid=17565)   "rope_theta": 1000000.0,
(WorkerDict pid=17565)   "sliding_window": 32768,
(WorkerDict pid=17565)   "tie_word_embeddings": true,
(WorkerDict pid=17565)   "torch_dtype": "bfloat16",
(WorkerDict pid=17565)   "transformers_version": "4.51.0",
(WorkerDict pid=17565)   "use_cache": true,
(WorkerDict pid=17565)   "use_sliding_window": false,
(WorkerDict pid=17565)   "vocab_size": 151936
(WorkerDict pid=17565) }
(WorkerDict pid=17565) 
(WorkerDict pid=17565) Qwen2ForCausalLM contains 494.03M parameters
(WorkerDict pid=17565) wrap_policy: functools.partial(<function _or_policy at 0x40305686b7e0>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x40305686b6a0>, transformer_layer_cls={<class 'transformers.models.qwen2.modeling_qwen2.Qwen2DecoderLayer'>})])
(WorkerDict pid=17884) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WorkerDict pid=17565) Actor use_remove_padding=True
(WorkerDict pid=17565) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
(WorkerDict pid=17884) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [repeated 3x across cluster]
(WorkerDict pid=17884) [rank2]:[W423 01:45:09.109160841 ProcessGroupNCCL.cpp:4571] [PG ID 0 PG GUID 0 Rank 2]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [repeated 3x across cluster]
(WorkerDict pid=17565) Model config after override: Qwen2Config {
(WorkerDict pid=17565)   "architectures": [
(WorkerDict pid=17565)     "Qwen2ForCausalLM"
(WorkerDict pid=17565)   ],
(WorkerDict pid=17565)   "attention_dropout": 0.0,
(WorkerDict pid=17565)   "eos_token_id": 151645,
(WorkerDict pid=17565)   "hidden_act": "silu",
(WorkerDict pid=17565)   "hidden_size": 896,
(WorkerDict pid=17565)   "initializer_range": 0.02,
(WorkerDict pid=17565)   "intermediate_size": 4864,
(WorkerDict pid=17565)   "max_position_embeddings": 32768,
(WorkerDict pid=17565)   "max_window_layers": 21,
(WorkerDict pid=17565)   "model_type": "qwen2",
(WorkerDict pid=17565)   "num_attention_heads": 14,
(WorkerDict pid=17565)   "num_hidden_layers": 24,
(WorkerDict pid=17565)   "num_key_value_heads": 2,
(WorkerDict pid=17565)   "pad_token_id": 151643,
(WorkerDict pid=17565)   "rms_norm_eps": 1e-06,
(WorkerDict pid=17565)   "rope_scaling": null,
(WorkerDict pid=17565)   "rope_theta": 1000000.0,
(WorkerDict pid=17565)   "sliding_window": 32768,
(WorkerDict pid=17565)   "tie_word_embeddings": true,
(WorkerDict pid=17565)   "torch_dtype": "bfloat16",
(WorkerDict pid=17565)   "transformers_version": "4.51.0",
(WorkerDict pid=17565)   "use_cache": true,
(WorkerDict pid=17565)   "use_sliding_window": false,
(WorkerDict pid=17565)   "vocab_size": 151936
(WorkerDict pid=17565) }
(WorkerDict pid=17565) 
(WorkerDict pid=17565) Qwen2ForCausalLM contains 494.03M parameters
(WorkerDict pid=17565) Total steps: 105, num_warmup_steps: 0
(WorkerDict pid=17885) wrap_policy: functools.partial(<function _or_policy at 0x4030615cb7e0>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x4030615cb6a0>, transformer_layer_cls={<class 'transformers.models.qwen2.modeling_qwen2.Qwen2DecoderLayer'>})]) [repeated 7x across cluster]
(WorkerDict pid=17883) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention [repeated 4x across cluster]
(WorkerDict pid=17565) Before building sglang rollout, memory allocated (GB): 0.46010828018188477, memory reserved (GB): 2.166015625
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(WorkerDict pid=17883) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` [repeated 3x across cluster]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.02it/s]
(WorkerDict pid=17565) 
Capturing batches (avail_mem=34.33 GB):   0%|          | 0/35 [00:00<?, ?it/s]
Capturing batches (avail_mem=33.84 GB):   3%|▎         | 1/35 [00:01<00:47,  1.39s/it]
Capturing batches (avail_mem=33.66 GB):   6%|▌         | 2/35 [00:01<00:30,  1.10it/s]
Capturing batches (avail_mem=33.49 GB):   9%|▊         | 3/35 [00:02<00:25,  1.24it/s]
Capturing batches (avail_mem=33.33 GB):  11%|█▏        | 4/35 [00:03<00:21,  1.43it/s]
Capturing batches (avail_mem=33.17 GB):  14%|█▍        | 5/35 [00:04<00:23,  1.27it/s]
Capturing batches (avail_mem=33.01 GB):  17%|█▋        | 6/35 [00:04<00:20,  1.44it/s]
Capturing batches (avail_mem=32.86 GB):  20%|██        | 7/35 [00:05<00:20,  1.34it/s]
Capturing batches (avail_mem=32.72 GB):  23%|██▎       | 8/35 [00:06<00:21,  1.26it/s]
Capturing batches (avail_mem=32.58 GB):  26%|██▌       | 9/35 [00:07<00:21,  1.22it/s]
Capturing batches (avail_mem=32.45 GB):  29%|██▊       | 10/35 [00:07<00:19,  1.28it/s]
Capturing batches (avail_mem=32.32 GB):  31%|███▏      | 11/35 [00:08<00:16,  1.44it/s]
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.58it/s]
(WorkerDict pid=17884) 
Capturing batches (avail_mem=32.20 GB):  34%|███▍      | 12/35 [00:08<00:14,  1.57it/s]
Capturing batches (avail_mem=32.08 GB):  37%|███▋      | 13/35 [00:09<00:14,  1.52it/s]
Capturing batches (avail_mem=33.99 GB):   0%|          | 0/35 [00:00<?, ?it/s]
Capturing batches (avail_mem=31.97 GB):  40%|████      | 14/35 [00:10<00:12,  1.63it/s]
Capturing batches (avail_mem=31.49 GB):  57%|█████▋    | 20/35 [00:13<00:07,  1.99it/s] [repeated 11x across cluster]
Capturing batches (avail_mem=31.11 GB):  86%|████████▌ | 30/35 [00:18<00:02,  2.01it/s] [repeated 21x across cluster]
Capturing batches (avail_mem=31.07 GB):  91%|█████████▏| 32/35 [00:19<00:01,  2.05it/s]
Capturing batches (avail_mem=31.06 GB):  94%|█████████▍| 33/35 [00:19<00:00,  2.07it/s]
Capturing batches (avail_mem=31.05 GB):  97%|█████████▋| 34/35 [00:20<00:00,  2.08it/s]
Capturing batches (avail_mem=31.05 GB): 100%|██████████| 35/35 [00:20<00:00,  1.70it/s]
(WorkerDict pid=17883) kwargs: {'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
(WorkerDict pid=17885) Actor use_remove_padding=True [repeated 7x across cluster]
(WorkerDict pid=17885) Total steps: 105, num_warmup_steps: 0 [repeated 3x across cluster]
(WorkerDict pid=17565) /usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
(WorkerDict pid=17565)   warnings.warn(
(WorkerDict pid=17565) After building sglang rollout, memory allocated (GB): 0.46010828018188477, memory reserved (GB): 2.166015625
(WorkerDict pid=17565) After building sharding manager, memory allocated (GB): 0.46010828018188477, memory reserved (GB): 2.166015625
Capturing batches (avail_mem=30.90 GB):  77%|███████▋  | 27/35 [00:13<00:03,  2.02it/s] [repeated 12x across cluster]
Capturing batches (avail_mem=30.76 GB):  91%|█████████▏| 32/35 [00:16<00:01,  2.03it/s]
Capturing batches (avail_mem=30.76 GB):  94%|█████████▍| 33/35 [00:16<00:00,  2.03it/s]
Capturing batches (avail_mem=30.75 GB):  97%|█████████▋| 34/35 [00:17<00:00,  2.03it/s]
(WorkerDict pid=17883) /usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
(WorkerDict pid=17883)   warnings.warn(
Capturing batches (avail_mem=30.75 GB): 100%|██████████| 35/35 [00:17<00:00,  1.96it/s]
(WorkerDict pid=17885) /usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
(WorkerDict pid=17885)   warnings.warn(
Capturing batches (avail_mem=30.77 GB):  89%|████████▊ | 31/35 [00:15<00:01,  2.02it/s] [repeated 4x across cluster]
(WorkerDict pid=17885) kwargs: {'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 2x across cluster]
(TaskRunner pid=16292) wandb: Currently logged in as: <user> to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
(TaskRunner pid=16292) wandb: Tracking run with wandb version 0.19.10
(TaskRunner pid=16292) wandb: Run data is saved locally in /workspace/verl/wandb/run-20250423_014617-n98y2dl8
(TaskRunner pid=16292) wandb: Run `wandb offline` to turn off syncing.
(TaskRunner pid=16292) wandb: Syncing run grpo_GSM8k_qwen0.5_test
(TaskRunner pid=16292) wandb: ⭐️ View project at https://wandb.ai/<user>/verl_grpo_example_gsm8k
(TaskRunner pid=16292) wandb: 🚀 View run at https://wandb.ai/<user>/verl_grpo_example_gsm8k/runs/n98y2dl8
(TaskRunner pid=16292) Using LocalLogger is deprecated. The constructor API will change 
(TaskRunner pid=16292) Checkpoint tracker file does not exist: %s /workspace/verl/checkpoints/verl_grpo_example_gsm8k/grpo_GSM8k_qwen0.5_test/latest_checkpointed_iteration.txt
(TaskRunner pid=16292) Training from scratch
(TaskRunner pid=16292) test_gen_batch meta info: {'eos_token_id': 151645, 'pad_token_id': 151643, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
(WorkerDict pid=17565) /usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/verl_engine.py:160: RuntimeWarning: coroutine 'TokenizerManager.flush_cache' was never awaited
(WorkerDict pid=17565)   self._engine.tokenizer_manager.flush_cache()
(WorkerDict pid=17565) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(WorkerDict pid=17884) /usr/local/lib/python3.12/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
(WorkerDict pid=17884)   warnings.warn(
(WorkerDict pid=17884) self.sampling_params={'n': 1, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
(WorkerDict pid=17884) kwargs: {'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
(WorkerDict pid=17884) /usr/local/lib/python3.12/dist-packages/sglang/srt/utils.py:888: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(WorkerDict pid=17884)   tensor_data = torch.ByteTensor(
(WorkerDict pid=17884) /usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/verl_engine.py:160: RuntimeWarning: coroutine 'TokenizerManager.flush_cache' was never awaited
(WorkerDict pid=17884)   self._engine.tokenizer_manager.flush_cache()
(WorkerDict pid=17884) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(TaskRunner pid=16292) validation generation end
(WorkerDict pid=17883) self.sampling_params={'n': 1, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 3x across cluster]
(TaskRunner pid=16292) [prompt] system
(TaskRunner pid=16292) You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
(TaskRunner pid=16292) user
(TaskRunner pid=16292) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####".
(TaskRunner pid=16292) assistant
(TaskRunner pid=16292) 
(TaskRunner pid=16292) [response] To determine how much Janet makes at the farmers' market every day, we need to follow these steps:
(TaskRunner pid=16292) 
(TaskRunner pid=16292) 1. **Calculate the total number of eggs laid by the ducks in a day:**
(TaskRunner pid=16292)    - Janet's ducks lay 16 eggs per day.
(TaskRunner pid=16292) 
(TaskRunner pid=16292) 2. **Calculate the total number of eggs Janet eats in a day:**
(TaskRunner pid=16292)    - Janet eats 3 eggs for breakfast.
(TaskRunner pid=16292)    - She eats 4 muffins for baking.
(TaskRunner pid=16292)    - Therefore, the total number of eggs she eats in a day is:
(TaskRunner pid=16292)      \[
(TaskRunner pid=16292)      3 \text{ (breakfast)} + 4 \text{ (baking)} = 7 \text{ eggs}
(TaskRunner pid=16292)      \]
(TaskRunner pid=16292) 
(TaskRunner pid=16292) 3. **Calculate the number of eggs Janet sells at the farmers' market in a day:**
(TaskRunner pid=16292)    - She sells the remainder of the eggs at the farmers' market.
(TaskRunner pid=16292)    - The total number of eggs laid in a day is 16.
(TaskRunner pid=16292)    - Subtract the number of eggs she eats from the total:
(TaskRunner pid=16292)      \[
(TaskRunner pid=16292)      16 \text{ (total eggs)} - 7 \text{ (eggs eaten)} = 9 \text{ eggs}
(TaskRunner pid=16292)      \]
(TaskRunner pid=16292) 
(TaskRunner pid=16292) 4. **Calculate the total revenue from selling the eggs at the farmers' market:**
(TaskRunner pid=16292)    - Each egg is sold for $2.
(TaskRunner pid=16292)    - The number of eggs sold is 9.
(TaskRunner pid=16292)    - Therefore, the total revenue is:
(TaskRunner pid=16292)      \[
(TaskRunner pid=16292)      9 \text{ eggs} \times 2 \text{ dollars/egg} = 18 \text{ dollars}
(TaskRunner pid=16292)      \]
(TaskRunner pid=16292) 
(TaskRunner pid=16292) Thus, Janet makes \(\boxed{18}\) dollars every day at the farmers' market.
(TaskRunner pid=16292) [ground_truth] 18
(TaskRunner pid=16292) [score] 0.0
Training Progress:   0%|          | 0/105 [00:00<?, ?it/s]
(TaskRunner pid=16292) ("Initial validation metrics: {'val/test_score/openai/gsm8k': "
(TaskRunner pid=16292)  '0.000758150113722517}')
(TaskRunner pid=16292) step:0 - val/test_score/openai/gsm8k:0.001
Training Progress:   1%|          | 1/105 [01:42<2:57:35, 102.46s/it]
(WorkerDict pid=17565) /usr/local/lib/python3.12/dist-packages/sglang/srt/utils.py:888: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(WorkerDict pid=17565)   tensor_data = torch.ByteTensor(
(TaskRunner pid=16292) step:1 - global_seqlen/min:540297.000 - global_seqlen/max:558462.000 - global_seqlen/minmax_diff:18165.000 - global_seqlen/balanced_min:552118.000 - global_seqlen/balanced_max:552119.000 - global_seqlen/mean:552118.250 - actor/kl_loss:0.001 - actor/kl_coef:0.001 - actor/entropy_loss:0.564 - actor/pg_loss:0.006 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.085 - perf/mfu/actor:0.948 - perf/max_memory_allocated_gb:25.921 - perf/max_memory_reserved_gb:61.611 - perf/cpu_memory_used_gb:333.392 - actor/lr:0.000 - critic/score/mean:0.009 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.009 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.001 - critic/advantages/max:1.789 - critic/advantages/min:-1.095 - critic/returns/mean:-0.001 - critic/returns/max:1.789 - critic/returns/min:-1.095 - response_length/mean:326.915 - response_length/max:1024.000 - response_length/min:3.000 - response_length/clip_ratio:0.007 - prompt_length/mean:104.428 - prompt_length/max:215.000 - prompt_length/min:65.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:74.396 - timing_s/old_log_prob:6.685 - timing_s/ref:3.780 - timing_s/adv:1.153 - timing_s/update_actor:15.487 - timing_s/step:101.599 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/ref:0.002 - timing_per_token_ms/update_actor:0.007 - timing_per_token_ms/gen:0.044 - perf/total_num_tokens:2208473.000 - perf/time_per_step:101.599 - perf/throughput:5434.299
(WorkerDict pid=17885) self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 4x across cluster]
(WorkerDict pid=17565) /usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/verl_engine.py:160: RuntimeWarning: coroutine 'TokenizerManager.flush_cache' was never awaited
(WorkerDict pid=17565)   self._engine.tokenizer_manager.flush_cache()
(WorkerDict pid=17565) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Training Progress:   2%|▏         | 2/105 [03:19<2:49:58, 99.02s/it] 
(WorkerDict pid=17884) /usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/verl_engine.py:160: RuntimeWarning: coroutine 'TokenizerManager.flush_cache' was never awaited
(WorkerDict pid=17884)   self._engine.tokenizer_manager.flush_cache()
(WorkerDict pid=17884) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(TaskRunner pid=16292) step:2 - global_seqlen/min:538646.000 - global_seqlen/max:556412.000 - global_seqlen/minmax_diff:17766.000 - global_seqlen/balanced_min:547190.000 - global_seqlen/balanced_max:547191.000 - global_seqlen/mean:547190.500 - actor/kl_loss:0.001 - actor/kl_coef:0.001 - actor/entropy_loss:0.554 - actor/pg_loss:0.001 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.069 - perf/mfu/actor:1.018 - perf/max_memory_allocated_gb:25.921 - perf/max_memory_reserved_gb:61.611 - perf/cpu_memory_used_gb:360.092 - actor/lr:0.000 - critic/score/mean:0.011 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.011 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.003 - critic/advantages/max:1.789 - critic/advantages/min:-0.730 - critic/returns/mean:-0.003 - critic/returns/max:1.789 - critic/returns/min:-0.730 - response_length/mean:324.958 - response_length/max:1024.000 - response_length/min:9.000 - response_length/clip_ratio:0.004 - prompt_length/mean:102.534 - prompt_length/max:256.000 - prompt_length/min:63.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:74.020 - timing_s/old_log_prob:3.566 - timing_s/ref:3.420 - timing_s/adv:1.167 - timing_s/update_actor:14.302 - timing_s/step:96.551 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/ref:0.002 - timing_per_token_ms/update_actor:0.007 - timing_per_token_ms/gen:0.044 - perf/total_num_tokens:2188762.000 - perf/time_per_step:96.551 - perf/throughput:5667.355
(WorkerDict pid=17565) self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 4x across cluster]

Apr 22 '25 23:04 EduardDurech

@zhaochenyang20

Apr 22 '25 23:04 EduardDurech

cool we keep track of this @ocss884

Apr 22 '25 23:04 zhaochenyang20

Update: flashinfer-python 0.2.3 has the same issue and 0.2.5 has OOM every time

Log flashinfer_python 0.2.5

(TaskRunner pid=279996) step:0 - val/test_score/openai/gsm8k:0.001
Training Progress:   1%|          | 1/105 [01:41<2:56:05, 101.59s/it]
(WorkerDict pid=281254) /usr/local/lib/python3.12/dist-packages/sglang/srt/utils.py:888: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
(WorkerDict pid=281254)   tensor_data = torch.ByteTensor(
(TaskRunner pid=279996) step:1 - global_seqlen/min:540298.000 - global_seqlen/max:562562.000 - global_seqlen/minmax_diff:22264.000 - global_seqlen/balanced_min:549130.000 - global_seqlen/balanced_max:549131.000 - global_seqlen/mean:549130.750 - actor/kl_loss:0.001 - actor/kl_coef:0.001 - actor/entropy_loss:0.532 - actor/pg_loss:-0.003 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/grad_norm:0.071 - perf/mfu/actor:0.965 - perf/max_memory_allocated_gb:27.409 - perf/max_memory_reserved_gb:46.957 - perf/cpu_memory_used_gb:320.904 - actor/lr:0.000 - critic/score/mean:0.010 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.010 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.001 - critic/advantages/max:1.789 - critic/advantages/min:-0.730 - critic/returns/mean:-0.001 - critic/returns/max:1.789 - critic/returns/min:-0.730 - response_length/mean:324.581 - response_length/max:1024.000 - response_length/min:4.000 - response_length/clip_ratio:0.005 - prompt_length/mean:104.428 - prompt_length/max:215.000 - prompt_length/min:65.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:75.329 - timing_s/old_log_prob:5.548 - timing_s/ref:3.553 - timing_s/adv:1.124 - timing_s/update_actor:15.137 - timing_s/step:100.772 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/update_actor:0.007 - timing_per_token_ms/gen:0.045 - timing_per_token_ms/ref:0.002 - perf/total_num_tokens:2196523.000 - perf/time_per_step:100.772 - perf/throughput:5449.227
(WorkerDict pid=281551) self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 4x across cluster]
(WorkerDict pid=281254) /usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/verl_engine.py:160: RuntimeWarning: coroutine 'TokenizerManager.flush_cache' was never awaited
(WorkerDict pid=281254)   self._engine.tokenizer_manager.flush_cache()
(WorkerDict pid=281254) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(WorkerDict pid=281552) [2025-04-23 19:23:11 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 112, in forward_thread_func
(WorkerDict pid=281552)     self.forward_thread_func_()
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(WorkerDict pid=281552)     return func(*args, **kwargs)
(WorkerDict pid=281552)            ^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 143, in forward_thread_func_
(WorkerDict pid=281552)     logits_output, next_token_ids = self.worker.forward_batch_generation(
(WorkerDict pid=281552)                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker.py", line 184, in forward_batch_generation
(WorkerDict pid=281552)     next_token_ids = self.model_runner.sample(logits_output, model_worker_batch)
(WorkerDict pid=281552)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 1093, in sample
(WorkerDict pid=281552)     next_token_ids = self.sampler(
(WorkerDict pid=281552)                      ^^^^^^^^^^^^^
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(WorkerDict pid=281552)     return self._call_impl(*args, **kwargs)
(WorkerDict pid=281552)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(WorkerDict pid=281552)     return forward_call(*args, **kwargs)
(WorkerDict pid=281552)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/sampler.py", line 92, in forward
(WorkerDict pid=281552)     top_p_normalize_probs_torch(probs, sampling_info.top_ps)
(WorkerDict pid=281552)   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/sampler.py", line 234, in top_p_normalize_probs_torch
(WorkerDict pid=281552)     probs_sort, probs_idx = probs.sort(dim=-1, descending=True)
(WorkerDict pid=281552)                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(WorkerDict pid=281552) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.84 GiB. GPU 1 has a total capacity of 94.50 GiB of which 4.33 GiB is free. Process 281553 has 13.01 GiB memory in use. Including non-PyTorch memory, this process has 75.55 GiB memory in use. Of the allocated memory 65.20 GiB is allocated by PyTorch, with 258.09 MiB allocated in private pools (e.g., CUDA Graphs), and 8.57 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(WorkerDict pid=281552) 
(WorkerDict pid=281552) [2025-04-23 19:23:11] Received sigquit from a child process. It usually means the child failed.
(WorkerDict pid=281552) /usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/verl_engine.py:160: RuntimeWarning: coroutine 'TokenizerManager.flush_cache' was never awaited
(WorkerDict pid=281552)   self._engine.tokenizer_manager.flush_cache()
(WorkerDict pid=281552) RuntimeWarning: Enable tracemalloc to get the object allocation traceback
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff2650321679bef3307a983d1901000000 Worker ID: dd8365e52a1d680ffaf653a2f3421841cf5a8b7786578943a197d6b4 Node ID: 5103e02edceaed1a4be35f75171744e515a2500e8f43410cc01d5026 Worker IP address: 172.28.30.164 Worker port: 36799 Worker PID: 281552 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(WorkerDict pid=281551) self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 4x across cluster]
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/workspace/verl/data/train.parquet', 'data.val_files=/workspace/verl/data/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=1024', 'data.max_response_length=1024', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=sglang', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.rollout.enforce_eager=False', 'actor_rollout_ref.rollout.free_cache_engine=False', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=grpo_GSM8k_qwen0.5_test', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=5', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "/workspace/verl/verl/trainer/main_ppo.py", line 54, in main
    run_ppo(config)
  File "/workspace/verl/verl/trainer/main_ppo.py", line 72, in run_ppo
    ray.get(runner.run.remote(config))
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2782, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 929, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ray::TaskRunner.run() (pid=279996, ip=172.28.30.164, actor_id=f1859511a011e833790520db01000000, repr=<main_ppo.TaskRunner object at 0x40000ba1b380>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/verl/verl/trainer/main_ppo.py", line 171, in run
    trainer.fit()
  File "/workspace/verl/verl/trainer/ppo/ray_trainer.py", line 817, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/verl/verl/single_controller/ray/base.py", line 42, in func
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: create_colocated_worker_cls.<locals>.WorkerDict
        actor_id: 2650321679bef3307a983d1901000000
        pid: 281552
        name: PfngwQWorkerDict_0:2
        namespace: 79b5054e-169f-499a-8239-36727efd50da
        ip: 172.28.30.164
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Apr 23 '25 17:04 EduardDurech

cc @yzh119 for visibility

May 01 '25 20:05 eric-haibin-lin

Update: flashinfer-python 0.2.3 has the same issue and 0.2.5 has OOM every time Log flashinfer_python 0.2.5

Update: flashinfer-python 0.2.5 works with newest commit and sgl{ang,-kernel} updates

flashinfer-python                 0.2.5
sgl-kernel                        0.1.0
sglang                            0.4.6.post1

Speed the same

May 05 '25 09:05 EduardDurech

Update: flashinfer-python 0.2.3 has the same issue and 0.2.5 has OOM every time Log flashinfer_python 0.2.5

Update: flashinfer-python 0.2.5 works with newest commit and sgl{ang,-kernel} updates
flashinfer-python                 0.2.5
sgl-kernel                        0.1.0
sglang                            0.4.6.post1
Speed the same

Interesting, have you tested SGLang vs vllm along without verl on your machine?

May 05 '25 14:05 ocss884

same question,

I tested both vLLM and SGLang. When using tensor_model_parallel_size=2, SGLang is much slower than vLLM. However, when tensor_model_parallel_size=1, their speeds are nearly the same.

May 16 '25 08:05 Yzx835

same question,

I tested both vLLM and SGLang. When using tensor_model_parallel_size=2, SGLang is much slower than vLLM. However, when tensor_model_parallel_size=1, their speeds are nearly the same.

@hebiao064 Could you help to test on H200?

May 16 '25 19:05 zhaochenyang20

Interesting, have you tested SGLang vs vllm along without verl on your machine?

@ocss884 sorry for the late reply, there doesn't seem to be any standard local inference engine benchmarks so had to make my own with whatever free time

Here are the results for offline and server if all is correct

engine	mean_req_lat_ms	mean_ttft_ms	mean_tok_s	global_tok_s
sglang	361.91	–	716.41	–
sglang_srv	–	66.72	591.00	2402.15
vllm	777.19	–	322.59	–
vllm_srv	–	41.60	226.74	1216.74

Also I was waiting for https://github.com/vllm-project/vllm/pull/15777 so dependencies are the same

flash_attn                               2.7.3
flash_attn_3                             3.0.0b1
flashinfer-python                        0.2.5          #Built from later commit
sgl-kernel                               0.1.2.post1
sglang                                   0.4.6.post4
transformers                             4.51.1
vllm                                     0.8.5.post1    #Built from later commit

May 18 '25 01:05 EduardDurech

Here is the same image with main to https://github.com/volcengine/verl/commit/3a7376acfef33af8b762526aecf3016c7ddcd997

More graphs

Some things that improved in newer veRL/-SGLang timing_{s, per_token_ms}/{adv, ref} we are getting Ray dashboard with Nsight setup on our cluster so can profile, also on Slack I've offered if you guys want access to a node we can give

May 18 '25 01:05 EduardDurech

I tested both vLLM and SGLang. When using tensor_model_parallel_size=2, SGLang is much slower than vLLM. However, when tensor_model_parallel_size=1, their speeds are nearly the same.

vLLM vs. SGLang no veRL

TP=1

engine	mean_req_lat_ms	mean_ttft_ms	mean_tok_s	global_tok_s
sglang	361.91	–	716.41	–
sglang_srv	–	66.72	591.00	2402.15
vllm	777.19	–	322.59	–
vllm_srv	–	41.60	226.74	1216.74

TP=2

engine	mean_req_lat_ms	mean_ttft_ms	mean_tok_s	global_tok_s
sglang	380.34	–	682.71	–
sglang_srv	–	173.23	543.66	1706.27
vllm	1307.92	–	190.43	–
vllm_srv	–	45.02	145.91	800.06

normal SGLang is still faster than vLLM with TP=2, will test veRL

May 18 '25 03:05 EduardDurech

We are indeed testing sglang verl in our own utility. Please stay tuned!

May 18 '25 04:05 zhaochenyang20

I tested both vLLM and SGLang. When using tensor_model_parallel_size=2, SGLang is much slower than vLLM. However, when tensor_model_parallel_size=1, their speeds are nearly the same.

TP seems to be at least part of the problem, TP=1 SGLang, vLLM are similar, TP=2 both significantly decrease, SGLang even more

Very different behaviour from standalone SGLang and vLLM

More graphs

May 18 '25 10:05 EduardDurech

TP=2 also frequently gets OOM with veRL-SGLang, I don't get the same issue with TP=1 nor vLLM TP={1,2}

Not sure if this is an issue with NCCL (cluster maintainers have been mentioning there is an issue in NCCL Nvidia is looking at), but it only happens with veRL-SGLang TP=2

veRL-SGLang TP=2 log

++ pwd
++ pwd
+ python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files=/workspace/verl/data/train.parquet data.val_files=/workspace/verl/data/test.parquet data.train_batch_size=1024 data.max_prompt_length=1024 data.max_response_length=1024 data.filter_overlong_prompts=True data.truncation=error actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=256 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40 actor_rollout_ref.actor.use_kl_loss=True actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40 actor_rollout_ref.rollout.tensor_model_parallel_size=2 actor_rollout_ref.rollout.name=sglang actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.n=5 actor_rollout_ref.rollout.enforce_eager=False actor_rollout_ref.rollout.free_cache_engine=False actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40 actor_rollout_ref.ref.fsdp_config.param_offload=True actor_rollout_ref.ref.strategy=fsdp2 actor_rollout_ref.actor.strategy=fsdp2 critic.strategy=fsdp2 reward_model.strategy=fsdp2 algorithm.kl_ctrl.kl_coef=0.001 trainer.critic_warmup=0 'trainer.logger=[console,wandb]' trainer.project_name=verl_grpo_example_gsm8k trainer.experiment_name=sgl_2_grpo_GSM8k_qwen0.5_test trainer.n_gpus_per_node=4 trainer.nnodes=1 trainer.save_freq=-1 trainer.test_freq=5 trainer.total_epochs=15
2025-05-18 14:43:54,387 INFO worker.py:1888 -- Started a local Ray instance.
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:   0%|          | 0/7473 [00:00<?, ? examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  13%|█▎        | 1000/7473 [00:00<00:01, 4234.66 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  27%|██▋       | 2000/7473 [00:00<00:01, 4580.09 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  40%|████      | 3000/7473 [00:00<00:00, 4744.65 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  54%|█████▎    | 4000/7473 [00:00<00:00, 4800.17 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  67%|██████▋   | 5000/7473 [00:01<00:00, 4820.00 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  80%|████████  | 6000/7473 [00:01<00:00, 4863.94 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  94%|█████████▎| 7000/7473 [00:01<00:00, 4882.92 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens: 100%|██████████| 7473/7473 [00:01<00:00, 4784.23 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:   0%|          | 0/1319 [00:00<?, ? examples/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MFiltering prompts longer than 1024 tokens:  76%|███████▌  | 1000/1319 [00:00<00:00, 4815.39 examples/s]^MFiltering prompts longer than 1024 tokens: 100%|██████████| 1319/1319 [00:00<00:00, 4714.26 examples/s]
^[[36m(TaskRunner pid=37599)^[[0m DeprecationWarning: `ray.state.available_resources_per_node` is a private attribute and access will be removed in a future Ray version.
^[[36m(TaskRunner pid=37599)^[[0m WARNING:2025-05-18 14:44:19,154:Waiting for register center actor ItrCVf_register_center to be ready. Elapsed time: 0 seconds out of 300 seconds.
^[[36m(WorkerDict pid=39180)^[[0m You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
^[[36m(WorkerDict pid=39181)^[[0m [rank3]:[W518 14:44:35.996306610 ProcessGroupNCCL.cpp:4571] [PG ID 0 PG GUID 0 Rank 3]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
^[[36m(WorkerDict pid=39180)^[[0m Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
^[[36m(WorkerDict pid=38869)^[[0m You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.^[[32m [repeated 3x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m [rank1]:[W518 14:44:36.483249218 ProcessGroupNCCL.cpp:4571] [PG ID 0 PG GUID 0 Rank 1]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.^[[32m [repeated 3x across cluster]^[[0m
^[[36m(WorkerDict pid=38869)^[[0m ^M  0%|          | 0/23 [00:00<?, ?it/s]^MCapturing batches (avail_mem=35.12 GB):   0%|          | 0/23 [00:00<?, ?it/s]
^[[36m(WorkerDict pid=38869)^[[0m Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`^[[32m [repeated 3x across cluster]^[[0m
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=35.12 GB):   4%|▍         | 1/23 [00:00<00:17,  1.26it/s]^MCapturing batches (avail_mem=34.78 GB):   4%|▍         | 1/23 [00:00<00:17,  1.26it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.78 GB):   9%|▊         | 2/23 [00:01<00:13,  1.61it/s]^MCapturing batches (avail_mem=34.67 GB):   9%|▊         | 2/23 [00:01<00:13,  1.61it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.67 GB):  13%|█▎        | 3/23 [00:01<00:12,  1.62it/s]^MCapturing batches (avail_mem=34.56 GB):  13%|█▎        | 3/23 [00:01<00:12,  1.62it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.56 GB):  17%|█▋        | 4/23 [00:02<00:10,  1.82it/s]^MCapturing batches (avail_mem=34.46 GB):  17%|█▋        | 4/23 [00:02<00:10,  1.82it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.46 GB):  22%|██▏       | 5/23 [00:02<00:09,  1.95it/s]^MCapturing batches (avail_mem=34.37 GB):  22%|██▏       | 5/23 [00:02<00:09,  1.95it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.37 GB):  26%|██▌       | 6/23 [00:03<00:08,  2.04it/s]^MCapturing batches (avail_mem=34.28 GB):  26%|██▌       | 6/23 [00:03<00:08,  2.04it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.28 GB):  30%|███       | 7/23 [00:03<00:07,  2.10it/s]^MCapturing batches (avail_mem=34.19 GB):  30%|███       | 7/23 [00:03<00:07,  2.10it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.19 GB):  35%|███▍      | 8/23 [00:04<00:07,  2.12it/s]^MCapturing batches (avail_mem=34.11 GB):  35%|███▍      | 8/23 [00:04<00:07,  2.12it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.11 GB):  39%|███▉      | 9/23 [00:04<00:06,  2.15it/s]^MCapturing batches (avail_mem=34.03 GB):  39%|███▉      | 9/23 [00:04<00:06,  2.15it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=34.03 GB):  43%|████▎     | 10/23 [00:05<00:05,  2.21it/s]^MCapturing batches (avail_mem=33.97 GB):  43%|████▎     | 10/23 [00:05<00:05,  2.21it/s]
^[[36m(WorkerDict pid=39180)^[[0m ^M  0%|          | 0/23 [00:00<?, ?it/s]^MCapturing batches (avail_mem=34.54 GB):   0%|          | 0/23 [00:00<?, ?it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=33.95 GB):  52%|█████▏    | 12/23 [00:05<00:04,  2.25it/s]^MCapturing batches (avail_mem=33.89 GB):  52%|█████▏    | 12/23 [00:05<00:04,  2.25it/s]^[[32m [repeated 4x across cluster]^[[0m
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=33.65 GB):  91%|█████████▏| 21/23 [00:09<00:00,  2.30it/s]^MCapturing batches (avail_mem=33.65 GB):  91%|█████████▏| 21/23 [00:09<00:00,  2.30it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=33.65 GB):  96%|█████████▌| 22/23 [00:10<00:00,  2.28it/s]^MCapturing batches (avail_mem=33.64 GB):  96%|█████████▌| 22/23 [00:10<00:00,  2.28it/s]
^[[36m(WorkerDict pid=38869)^[[0m ^MCapturing batches (avail_mem=33.64 GB): 100%|██████████| 23/23 [00:10<00:00,  2.28it/s]^MCapturing batches (avail_mem=33.64 GB): 100%|██████████| 23/23 [00:10<00:00,  2.15it/s]
^[[36m(WorkerDict pid=39180)^[[0m ^MCapturing batches (avail_mem=33.37 GB):  57%|█████▋    | 13/23 [00:06<00:05,  1.98it/s]^MCapturing batches (avail_mem=33.32 GB):  57%|█████▋    | 13/23 [00:06<00:05,  1.98it/s]^[[32m [repeated 19x across cluster]^[[0m
^[[36m(WorkerDict pid=39180)^[[0m ^MCapturing batches (avail_mem=33.26 GB):  70%|██████▉   | 16/23 [00:08<00:03,  1.91it/s]
^[[36m(WorkerDict pid=39180)^[[0m ^MCapturing batches (avail_mem=33.23 GB):  70%|██████▉   | 16/23 [00:08<00:03,  1.91it/s]
^[[36m(WorkerDict pid=39180)^[[0m ^MCapturing batches (avail_mem=33.13 GB):  91%|█████████▏| 21/23 [00:11<00:01,  1.96it/s]^MCapturing batches (avail_mem=33.12 GB):  91%|█████████▏| 21/23 [00:11<00:01,  1.96it/s]
^[[36m(WorkerDict pid=39180)^[[0m ^MCapturing batches (avail_mem=33.12 GB):  96%|█████████▌| 22/23 [00:11<00:00,  1.96it/s]^MCapturing batches (avail_mem=33.12 GB):  96%|█████████▌| 22/23 [00:11<00:00,  1.96it/s]
^[[36m(WorkerDict pid=39180)^[[0m ^MCapturing batches (avail_mem=33.12 GB): 100%|██████████| 23/23 [00:12<00:00,  1.97it/s]^MCapturing batches (avail_mem=33.12 GB): 100%|██████████| 23/23 [00:12<00:00,  1.89it/s]
^[[36m(WorkerDict pid=39180)^[[0m ^MCapturing batches (avail_mem=33.13 GB):  87%|████████▋ | 20/23 [00:10<00:01,  1.94it/s]^MCapturing batches (avail_mem=33.13 GB):  87%|████████▋ | 20/23 [00:10<00:01,  1.94it/s]^[[32m [repeated 6x across cluster]^[[0m
^[[36m(TaskRunner pid=37599)^[[0m wandb: Currently logged in as: to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
^[[36m(TaskRunner pid=37599)^[[0m wandb: Tracking run with wandb version 0.19.11
^[[36m(TaskRunner pid=37599)^[[0m wandb: Run data is saved locally in /workspace/verl/wandb/run-20250518_144532-e1f3wtvu
^[[36m(TaskRunner pid=37599)^[[0m wandb: Run `wandb offline` to turn off syncing.
^[[36m(TaskRunner pid=37599)^[[0m wandb: Syncing run sgl_2_grpo_GSM8k_qwen0.5_test
^[[36m(TaskRunner pid=37599)^[[0m wandb: ⭐️ View project at https://wandb.ai/verl_grpo_example_gsm8k
^[[36m(TaskRunner pid=37599)^[[0m wandb: 🚀 View run at https://wandb.ai/verl_grpo_example_gsm8k/runs/e1f3wtvu
^[[36m(WorkerDict pid=38869)^[[0m /usr/local/lib/python3.12/dist-packages/sglang/srt/utils.py:932: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
^[[36m(WorkerDict pid=38869)^[[0m   tensor_data = torch.ByteTensor(
^[[36m(TaskRunner pid=37599)^[[0m ^MTraining Progress:   0%|          | 0/105 [00:00<?, ?it/s]
^[[36m(TaskRunner pid=37599)^[[0m ^MTraining Progress:   1%|          | 1/105 [01:40<2:54:26, 100.64s/it]
^[[36m(WorkerDict pid=39180)^[[0m /usr/local/lib/python3.12/dist-packages/sglang/srt/utils.py:932: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
^[[36m(WorkerDict pid=39180)^[[0m   tensor_data = torch.ByteTensor(
^[[36m(TaskRunner pid=37599)^[[0m ^MTraining Progress:   2%|▏         | 2/105 [03:15<2:47:09, 97.38s/it]
^[[36m(TaskRunner pid=37599)^[[0m ^MTraining Progress:   3%|▎         | 3/105 [04:49<2:42:50, 95.79s/it]
^[[36m(WorkerDict pid=38869)^[[0m [2025-05-18 14:51:06 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 118, in forward_thread_func
^[[36m(WorkerDict pid=38869)^[[0m     self.forward_thread_func_()
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
^[[36m(WorkerDict pid=38869)^[[0m     return func(*args, **kwargs)
^[[36m(WorkerDict pid=38869)^[[0m            ^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 151, in forward_thread_func_
^[[36m(WorkerDict pid=38869)^[[0m     self.worker.forward_batch_generation(
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker.py", line 211, in forward_batch_generation
^[[36m(WorkerDict pid=38869)^[[0m     next_token_ids = self.model_runner.sample(
^[[36m(WorkerDict pid=38869)^[[0m                      ^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 1161, in sample
^[[36m(WorkerDict pid=38869)^[[0m     next_token_ids = self.sampler(
^[[36m(WorkerDict pid=38869)^[[0m                      ^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
^[[36m(WorkerDict pid=38869)^[[0m     return self._call_impl(*args, **kwargs)
^[[36m(WorkerDict pid=38869)^[[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
^[[36m(WorkerDict pid=38869)^[[0m     return forward_call(*args, **kwargs)
^[[36m(WorkerDict pid=38869)^[[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/sampler.py", line 92, in forward
^[[36m(WorkerDict pid=38869)^[[0m     top_p_normalize_probs_torch(probs, sampling_info.top_ps)
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/sampler.py", line 234, in top_p_normalize_probs_torch
^[[36m(WorkerDict pid=38869)^[[0m     probs_sort, probs_idx = probs.sort(dim=-1, descending=True)
^[[36m(WorkerDict pid=38869)^[[0m                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.35 GiB. GPU 0 has a total capacity of 94.50 GiB of which 5.90 GiB is free. Process 38869 has 13.94 GiB memory in use. Including non-PyTorch memory, this process has 74.49 GiB memory in use. Of the allocated memory 65.72 GiB is allocated by PyTorch, with 222.59 MiB allocated in private pools (e.g., CUDA Graphs), and 7.22 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
^[[36m(WorkerDict pid=38869)^[[0m
^[[36m(WorkerDict pid=38869)^[[0m [2025-05-18 14:51:06] Received sigquit from a child process. It usually means the child failed.
^[[36m(WorkerDict pid=38869)^[[0m [2025-05-18 14:51:06 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 118, in forward_thread_func
^[[36m(WorkerDict pid=38869)^[[0m     self.forward_thread_func_()
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
^[[36m(WorkerDict pid=38869)^[[0m     return func(*args, **kwargs)
^[[36m(WorkerDict pid=38869)^[[0m            ^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 151, in forward_thread_func_
^[[36m(WorkerDict pid=38869)^[[0m     self.worker.forward_batch_generation(
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tp_worker.py", line 211, in forward_batch_generation
^[[36m(WorkerDict pid=38869)^[[0m     next_token_ids = self.model_runner.sample(
^[[36m(WorkerDict pid=38869)^[[0m                      ^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/model_executor/model_runner.py", line 1161, in sample
^[[36m(WorkerDict pid=38869)^[[0m     next_token_ids = self.sampler(
^[[36m(WorkerDict pid=38869)^[[0m                      ^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
^[[36m(WorkerDict pid=38869)^[[0m     return self._call_impl(*args, **kwargs)
^[[36m(WorkerDict pid=38869)^[[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
^[[36m(WorkerDict pid=38869)^[[0m     return forward_call(*args, **kwargs)
^[[36m(WorkerDict pid=38869)^[[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/sampler.py", line 92, in forward
^[[36m(WorkerDict pid=38869)^[[0m     top_p_normalize_probs_torch(probs, sampling_info.top_ps)
^[[36m(WorkerDict pid=38869)^[[0m   File "/usr/local/lib/python3.12/dist-packages/sglang/srt/layers/sampler.py", line 234, in top_p_normalize_probs_torch
^[[36m(WorkerDict pid=38869)^[[0m     probs_sort, probs_idx = probs.sort(dim=-1, descending=True)
^[[36m(WorkerDict pid=38869)^[[0m                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[36m(WorkerDict pid=38869)^[[0m torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.35 GiB. GPU 1 has a total capacity of 94.50 GiB of which 5.97 GiB is free. Process 39179 has 13.30 GiB memory in use. Including non-PyTorch memory, this process has 74.35 GiB memory in use. Of the allocated memory 65.58 GiB is allocated by PyTorch, with 222.59 MiB allocated in private pools (e.g., CUDA Graphs), and 7.22 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
^[[36m(WorkerDict pid=38869)^[[0m
^[[36m(WorkerDict pid=38869)^[[0m [2025-05-18 14:51:06] Received sigquit from a child process. It usually means the child failed.
^[[36m(WorkerDict pid=39180)^[[0m [rank2]:[W518 14:51:06.739446079 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=355, addr=[nid006560]:57628, remote=[nid006560]:42823): failed to recv, got 0 bytes
^[[36m(WorkerDict pid=39180)^[[0m Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:671 (most recent call first):
^[[36m(WorkerDict pid=39180)^[[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x402fb19aa9a4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #1: <unknown function> + 0x58c1f40 (0x402f67271f40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #2: <unknown function> + 0x58c50a4 (0x402f672750a4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #3: <unknown function> + 0x58c6084 (0x402f67276084 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x1cc (0x402f67276e4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x280 (0x402f6a26f670 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #6: <unknown function> + 0xe1ae0 (0x400005141ae0 in /usr/lib/aarch64-linux-gnu/libstdc++.so.6)
^[[36m(WorkerDict pid=39180)^[[0m frame #7: <unknown function> + 0x8595c (0x400002d2595c in /usr/lib/aarch64-linux-gnu/libc.so.6)
^[[36m(WorkerDict pid=39180)^[[0m frame #8: <unknown function> + 0xeba4c (0x400002d8ba4c in /usr/lib/aarch64-linux-gnu/libc.so.6)
^[[36m(WorkerDict pid=39180)^[[0m
^[[36m(WorkerDict pid=39180)^[[0m [rank2]:[W518 14:51:06.742004368 ProcessGroupNCCL.cpp:1671] [PG ID 0 PG GUID 0(default_pg) Rank 2] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes
^[[36m(WorkerDict pid=39180)^[[0m Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:671 (most recent call first):
^[[36m(WorkerDict pid=39180)^[[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x402fb19aa9a4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #1: <unknown function> + 0x58c1f40 (0x402f67271f40 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #2: <unknown function> + 0x58c50a4 (0x402f672750a4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #3: <unknown function> + 0x58c6084 (0x402f67276084 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x1cc (0x402f67276e4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x280 (0x402f6a26f670 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
^[[36m(WorkerDict pid=39180)^[[0m frame #6: <unknown function> + 0xe1ae0 (0x400005141ae0 in /usr/lib/aarch64-linux-gnu/libstdc++.so.6)
^[[36m(WorkerDict pid=39180)^[[0m frame #7: <unknown function> + 0x8595c (0x400002d2595c in /usr/lib/aarch64-linux-gnu/libc.so.6)
^[[36m(WorkerDict pid=39180)^[[0m frame #8: <unknown function> + 0xeba4c (0x400002d8ba4c in /usr/lib/aarch64-linux-gnu/libc.so.6)
^[[36m(WorkerDict pid=39180)^[[0m
^[[36m(WorkerDict pid=39179)^[[0m
^[[36m(WorkerDict pid=39179)^[[0m
^[[36m(TaskRunner pid=37599)^[[0m {'actor_rollout_ref': {'actor': {'checkpoint': {'contents': ['model',
^[[36m(TaskRunner pid=37599)^[[0m                                                              'optimizer',
^[[36m(TaskRunner pid=37599)^[[0m                                                              'extra']},
^[[36m(TaskRunner pid=37599)^[[0m                                  'clip_ratio': 0.2,
^[[36m(TaskRunner pid=37599)^[[0m                                  'clip_ratio_c': 3.0,
^[[36m(TaskRunner pid=37599)^[[0m                                  'clip_ratio_high': 0.2,
^[[36m(TaskRunner pid=37599)^[[0m                                  'clip_ratio_low': 0.2,
^[[36m(TaskRunner pid=37599)^[[0m                                  'entropy_coeff': 0,
^[[36m(TaskRunner pid=37599)^[[0m                                  'fsdp_config': {'fsdp_size': -1,
^[[36m(TaskRunner pid=37599)^[[0m                                                  'offload_policy': False,
^[[36m(TaskRunner pid=37599)^[[0m                                                  'optimizer_offload': False,
^[[36m(TaskRunner pid=37599)^[[0m                                                  'param_offload': False,
^[[36m(TaskRunner pid=37599)^[[0m                                                  'reshard_after_forward': True,
^[[36m(TaskRunner pid=37599)^[[0m                                                  'wrap_policy': {'min_num_params': 0}},
^[[36m(TaskRunner pid=37599)^[[0m                                  'grad_clip': 1.0,
^[[36m(TaskRunner pid=37599)^[[0m                                  'kl_loss_coef': 0.001,
^[[36m(TaskRunner pid=37599)^[[0m                                  'kl_loss_type': 'low_var_kl',
^[[36m(TaskRunner pid=37599)^[[0m                                  'loss_agg_mode': 'token-mean',
^[[36m(TaskRunner pid=37599)^[[0m                                  'optim': {'lr': 1e-06,
^[[36m(TaskRunner pid=37599)^[[0m                                            'lr_warmup_steps': -1,
^[[36m(TaskRunner pid=37599)^[[0m                                            'lr_warmup_steps_ratio': 0.0,
^[[36m(TaskRunner pid=37599)^[[0m                                            'min_lr_ratio': None,
^[[36m(TaskRunner pid=37599)^[[0m                                            'total_training_steps': -1,
^[[36m(TaskRunner pid=37599)^[[0m                                            'warmup_style': 'constant',
^[[36m(TaskRunner pid=37599)^[[0m                                            'weight_decay': 0.01},
^[[36m(TaskRunner pid=37599)^[[0m                                  'ppo_epochs': 1,
^[[36m(TaskRunner pid=37599)^[[0m                                  'ppo_max_token_len_per_gpu': 16384,
^[[36m(TaskRunner pid=37599)^[[0m                                  'ppo_micro_batch_size': None,
^[[36m(TaskRunner pid=37599)^[[0m                                  'ppo_micro_batch_size_per_gpu': 40,
^[[36m(TaskRunner pid=37599)^[[0m                                  'ppo_mini_batch_size': 256,
^[[36m(TaskRunner pid=37599)^[[0m                                  'shuffle': False,
^[[36m(TaskRunner pid=37599)^[[0m                                  'strategy': 'fsdp2',
^[[36m(TaskRunner pid=37599)^[[0m                                  'ulysses_sequence_parallel_size': 1,
^[[36m(TaskRunner pid=37599)^[[0m                                  'use_dynamic_bsz': False,
^[[36m(TaskRunner pid=37599)^[[0m                                  'use_kl_loss': True,
^[[36m(TaskRunner pid=37599)^[[0m                                  'use_torch_compile': True},
^[[36m(TaskRunner pid=37599)^[[0m                        'hybrid_engine': True,
^[[36m(TaskRunner pid=37599)^[[0m                        'model': {'enable_gradient_checkpointing': True,
^[[36m(TaskRunner pid=37599)^[[0m                                  'external_lib': None,
^[[36m(TaskRunner pid=37599)^[[0m                                  'override_config': {},
^[[36m(TaskRunner pid=37599)^[[0m                                  'path': 'Qwen/Qwen2.5-0.5B-Instruct',
^[[36m(TaskRunner pid=37599)^[[0m                                  'use_liger': False,
^[[36m(TaskRunner pid=37599)^[[0m                                  'use_remove_padding': True},
^[[36m(TaskRunner pid=37599)^[[0m                        'ref': {'fsdp_config': {'param_offload': True,
^[[36m(TaskRunner pid=37599)^[[0m                                                'reshard_after_forward': True,
^[[36m(TaskRunner pid=37599)^[[0m                                                'wrap_policy': {'min_num_params': 0}},
^[[36m(TaskRunner pid=37599)^[[0m                                'log_prob_max_token_len_per_gpu': 16384,
^[[36m(TaskRunner pid=37599)^[[0m                                'log_prob_micro_batch_size': None,
^[[36m(TaskRunner pid=37599)^[[0m                                'log_prob_micro_batch_size_per_gpu': 40,
^[[36m(TaskRunner pid=37599)^[[0m                                'log_prob_use_dynamic_bsz': False,
^[[36m(TaskRunner pid=37599)^[[0m                                'strategy': 'fsdp2',
^[[36m(TaskRunner pid=37599)^[[0m                                'ulysses_sequence_parallel_size': 1,
^[[36m(TaskRunner pid=37599)^[[0m                                'use_torch_compile': True},
^[[36m(TaskRunner pid=37599)^[[0m                        'rollout': {'chat_scheduler': None,
^[[36m(TaskRunner pid=37599)^[[0m                                    'disable_log_stats': True,
^[[36m(TaskRunner pid=37599)^[[0m                                    'do_sample': True,
^[[36m(TaskRunner pid=37599)^[[0m                                    'dtype': 'bfloat16',
^[[36m(TaskRunner pid=37599)^[[0m                                    'enable_chunked_prefill': True,
^[[36m(TaskRunner pid=37599)^[[0m                                    'enforce_eager': False,
^[[36m(TaskRunner pid=37599)^[[0m                                    'engine_kwargs': {'swap_space': None},
^[[36m(TaskRunner pid=37599)^[[0m                                    'free_cache_engine': False,
^[[36m(TaskRunner pid=37599)^[[0m                                    'gpu_memory_utilization': 0.6,
^[[36m(TaskRunner pid=37599)^[[0m                                    'ignore_eos': False,
^[[36m(TaskRunner pid=37599)^[[0m                                    'load_format': 'dummy_dtensor',
^[[36m(TaskRunner pid=37599)^[[0m                                    'log_prob_max_token_len_per_gpu': 16384,
^[[36m(TaskRunner pid=37599)^[[0m                                    'log_prob_micro_batch_size': None,
^[[36m(TaskRunner pid=37599)^[[0m                                    'log_prob_micro_batch_size_per_gpu': 40,
^[[36m(TaskRunner pid=37599)^[[0m                                    'log_prob_use_dynamic_bsz': False,
^[[36m(TaskRunner pid=37599)^[[0m                                    'max_model_len': None,
^[[36m(TaskRunner pid=37599)^[[0m                                    'max_num_batched_tokens': 8192,
^[[36m(TaskRunner pid=37599)^[[0m                                    'max_num_seqs': 1024,
^[[36m(TaskRunner pid=37599)^[[0m                                    'mode': 'sync',
^[[36m(TaskRunner pid=37599)^[[0m                                    'multi_turn': {'enable': False,
^[[36m(TaskRunner pid=37599)^[[0m                                                   'format': 'chatml',
^[[36m(TaskRunner pid=37599)^[[0m                                                   'max_turns': None,
^[[36m(TaskRunner pid=37599)^[[0m                                                   'tool_config_path': None},
^[[36m(TaskRunner pid=37599)^[[0m                                    'n': 5,
^[[36m(TaskRunner pid=37599)^[[0m                                    'name': 'sglang',
^[[36m(TaskRunner pid=37599)^[[0m                                    'prompt_length': 1024,
^[[36m(TaskRunner pid=37599)^[[0m                                    'response_length': 1024,
^[[36m(TaskRunner pid=37599)^[[0m                                    'temperature': 1.0,
^[[36m(TaskRunner pid=37599)^[[0m                                    'tensor_model_parallel_size': 2,
^[[36m(TaskRunner pid=37599)^[[0m                                    'top_k': -1,
^[[36m(TaskRunner pid=37599)^[[0m                                    'top_p': 1,
^[[36m(TaskRunner pid=37599)^[[0m                                    'use_fire_sampling': False,
^[[36m(TaskRunner pid=37599)^[[0m                                    'val_kwargs': {'do_sample': False,
^[[36m(TaskRunner pid=37599)^[[0m                                                   'n': 1,
^[[36m(TaskRunner pid=37599)^[[0m                                                   'temperature': 0,
^[[36m(TaskRunner pid=37599)^[[0m                                                   'top_k': -1,
^[[36m(TaskRunner pid=37599)^[[0m                                                   'top_p': 1.0}}},
^[[36m(TaskRunner pid=37599)^[[0m  'algorithm': {'adv_estimator': 'grpo',
^[[36m(TaskRunner pid=37599)^[[0m                'gamma': 1.0,
^[[36m(TaskRunner pid=37599)^[[0m                'kl_ctrl': {'horizon': 10000,
^[[36m(TaskRunner pid=37599)^[[0m                            'kl_coef': 0.001,
^[[36m(TaskRunner pid=37599)^[[0m                            'target_kl': 0.1,
^[[36m(TaskRunner pid=37599)^[[0m                            'type': 'fixed'},
^[[36m(TaskRunner pid=37599)^[[0m                'kl_penalty': 'kl',
^[[36m(TaskRunner pid=37599)^[[0m                'lam': 1.0,
^[[36m(TaskRunner pid=37599)^[[0m                'norm_adv_by_std_in_grpo': True,
^[[36m(TaskRunner pid=37599)^[[0m                'use_kl_in_reward': False},
^[[36m(TaskRunner pid=37599)^[[0m  'critic': {'checkpoint': {'contents': ['model', 'optimizer', 'extra']},
^[[36m(TaskRunner pid=37599)^[[0m             'cliprange_value': 0.5,
^[[36m(TaskRunner pid=37599)^[[0m             'forward_max_token_len_per_gpu': 32768,
^[[36m(TaskRunner pid=37599)^[[0m             'forward_micro_batch_size': None,
^[[36m(TaskRunner pid=37599)^[[0m             'forward_micro_batch_size_per_gpu': None,
^[[36m(TaskRunner pid=37599)^[[0m             'grad_clip': 1.0,
^[[36m(TaskRunner pid=37599)^[[0m             'model': {'enable_gradient_checkpointing': True,
^[[36m(TaskRunner pid=37599)^[[0m                       'external_lib': None,
^[[36m(TaskRunner pid=37599)^[[0m                       'fsdp_config': {'fsdp_size': -1,
^[[36m(TaskRunner pid=37599)^[[0m                                       'offload_policy': False,
^[[36m(TaskRunner pid=37599)^[[0m                                       'optimizer_offload': False,
^[[36m(TaskRunner pid=37599)^[[0m                                       'param_offload': False,
^[[36m(TaskRunner pid=37599)^[[0m                                       'reshard_after_forward': True,
^[[36m(TaskRunner pid=37599)^[[0m                                       'wrap_policy': {'min_num_params': 0}},
^[[36m(TaskRunner pid=37599)^[[0m                       'override_config': {},
^[[36m(TaskRunner pid=37599)^[[0m                       'path': '~/models/deepseek-llm-7b-chat',
^[[36m(TaskRunner pid=37599)^[[0m                       'tokenizer_path': 'Qwen/Qwen2.5-0.5B-Instruct',
^[[36m(TaskRunner pid=37599)^[[0m                       'use_remove_padding': False},
^[[36m(TaskRunner pid=37599)^[[0m             'optim': {'lr': 1e-05,
^[[36m(TaskRunner pid=37599)^[[0m                       'lr_warmup_steps_ratio': 0.0,
^[[36m(TaskRunner pid=37599)^[[0m                       'min_lr_ratio': None,
^[[36m(TaskRunner pid=37599)^[[0m                       'total_training_steps': -1,
^[[36m(TaskRunner pid=37599)^[[0m                       'warmup_style': 'constant',
^[[36m(TaskRunner pid=37599)^[[0m                       'weight_decay': 0.01},
^[[36m(TaskRunner pid=37599)^[[0m             'ppo_epochs': 1,
^[[36m(TaskRunner pid=37599)^[[0m             'ppo_max_token_len_per_gpu': 32768,
^[[36m(TaskRunner pid=37599)^[[0m             'ppo_micro_batch_size': None,
^[[36m(TaskRunner pid=37599)^[[0m             'ppo_micro_batch_size_per_gpu': None,
^[[36m(TaskRunner pid=37599)^[[0m             'ppo_mini_batch_size': 256,
^[[36m(TaskRunner pid=37599)^[[0m             'rollout_n': 5,
^[[36m(TaskRunner pid=37599)^[[0m             'shuffle': False,
^[[36m(TaskRunner pid=37599)^[[0m             'strategy': 'fsdp2',
^[[36m(TaskRunner pid=37599)^[[0m             'ulysses_sequence_parallel_size': 1,
^[[36m(TaskRunner pid=37599)^[[0m             'use_dynamic_bsz': False},
^[[36m(TaskRunner pid=37599)^[[0m  'custom_reward_function': {'name': 'compute_score', 'path': None},
^[[36m(TaskRunner pid=37599)^[[0m  'data': {'custom_cls': {'name': None, 'path': None},
^[[36m(TaskRunner pid=37599)^[[0m           'filter_overlong_prompts': True,
^[[36m(TaskRunner pid=37599)^[[0m           'filter_overlong_prompts_workers': 1,
^[[36m(TaskRunner pid=37599)^[[0m           'image_key': 'images',
^[[36m(TaskRunner pid=37599)^[[0m           'max_prompt_length': 1024,
^[[36m(TaskRunner pid=37599)^[[0m           'max_response_length': 1024,
^[[36m(TaskRunner pid=37599)^[[0m           'prompt_key': 'prompt',
^[[36m(TaskRunner pid=37599)^[[0m           'return_raw_chat': False,
^[[36m(TaskRunner pid=37599)^[[0m           'return_raw_input_ids': False,
^[[36m(TaskRunner pid=37599)^[[0m           'reward_fn_key': 'data_source',
^[[36m(TaskRunner pid=37599)^[[0m           'shuffle': True,
^[[36m(TaskRunner pid=37599)^[[0m           'tokenizer': None,
^[[36m(TaskRunner pid=37599)^[[0m           'train_batch_size': 1024,
^[[36m(TaskRunner pid=37599)^[[0m           'train_files': '/workspace/verl/data/train.parquet',
^[[36m(TaskRunner pid=37599)^[[0m           'truncation': 'error',
^[[36m(TaskRunner pid=37599)^[[0m           'val_batch_size': None,
^[[36m(TaskRunner pid=37599)^[[0m           'val_files': '/workspace/verl/data/test.parquet',
^[[36m(TaskRunner pid=37599)^[[0m           'video_key': 'videos'},
^[[36m(TaskRunner pid=37599)^[[0m  'ray_init': {'num_cpus': None},
^[[36m(TaskRunner pid=37599)^[[0m  'reward_model': {'enable': False,
^[[36m(TaskRunner pid=37599)^[[0m                   'forward_max_token_len_per_gpu': 32768,
^[[36m(TaskRunner pid=37599)^[[0m                   'launch_reward_fn_async': False,
^[[36m(TaskRunner pid=37599)^[[0m                   'max_length': None,
^[[36m(TaskRunner pid=37599)^[[0m                   'micro_batch_size': None,
^[[36m(TaskRunner pid=37599)^[[0m                   'micro_batch_size_per_gpu': None,
^[[36m(TaskRunner pid=37599)^[[0m                   'model': {'external_lib': None,
^[[36m(TaskRunner pid=37599)^[[0m                             'fsdp_config': {'fsdp_size': -1,
^[[36m(TaskRunner pid=37599)^[[0m                                             'param_offload': False,
^[[36m(TaskRunner pid=37599)^[[0m                                             'reshard_after_forward': True,
^[[36m(TaskRunner pid=37599)^[[0m                                             'wrap_policy': {'min_num_params': 0}},
^[[36m(TaskRunner pid=37599)^[[0m                             'input_tokenizer': 'Qwen/Qwen2.5-0.5B-Instruct',
^[[36m(TaskRunner pid=37599)^[[0m                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
^[[36m(TaskRunner pid=37599)^[[0m                             'use_remove_padding': False},
^[[36m(TaskRunner pid=37599)^[[0m                   'reward_manager': 'naive',
^[[36m(TaskRunner pid=37599)^[[0m                   'strategy': 'fsdp2',
^[[36m(TaskRunner pid=37599)^[[0m                   'ulysses_sequence_parallel_size': 1,
^[[36m(TaskRunner pid=37599)^[[0m                   'use_dynamic_bsz': False},
^[[36m(TaskRunner pid=37599)^[[0m  'trainer': {'balance_batch': True,
^[[36m(TaskRunner pid=37599)^[[0m              'critic_warmup': 0,
^[[36m(TaskRunner pid=37599)^[[0m              'default_hdfs_dir': None,
^[[36m(TaskRunner pid=37599)^[[0m              'default_local_dir': 'checkpoints/verl_grpo_example_gsm8k/sgl_2_grpo_GSM8k_qwen0.5_test',
^[[36m(TaskRunner pid=37599)^[[0m              'del_local_ckpt_after_load': False,
^[[36m(TaskRunner pid=37599)^[[0m              'experiment_name': 'sgl_2_grpo_GSM8k_qwen0.5_test',
^[[36m(TaskRunner pid=37599)^[[0m              'log_val_generations': 0,
^[[36m(TaskRunner pid=37599)^[[0m              'logger': ['console', 'wandb'],
^[[36m(TaskRunner pid=37599)^[[0m              'max_actor_ckpt_to_keep': None,
^[[36m(TaskRunner pid=37599)^[[0m              'max_critic_ckpt_to_keep': None,
^[[36m(TaskRunner pid=37599)^[[0m              'n_gpus_per_node': 4,
^[[36m(TaskRunner pid=37599)^[[0m              'nnodes': 1,
^[[36m(TaskRunner pid=37599)^[[0m              'project_name': 'verl_grpo_example_gsm8k',
^[[36m(TaskRunner pid=37599)^[[0m              'ray_wait_register_center_timeout': 300,
^[[36m(TaskRunner pid=37599)^[[0m              'resume_from_path': None,
^[[36m(TaskRunner pid=37599)^[[0m              'resume_mode': 'auto',
^[[36m(TaskRunner pid=37599)^[[0m              'rollout_data_dir': None,
^[[36m(TaskRunner pid=37599)^[[0m              'save_freq': -1,
^[[36m(TaskRunner pid=37599)^[[0m              'test_freq': 5,
^[[36m(TaskRunner pid=37599)^[[0m              'total_epochs': 15,
^[[36m(TaskRunner pid=37599)^[[0m              'total_training_steps': None,
^[[36m(TaskRunner pid=37599)^[[0m              'val_before_train': True,
^[[36m(TaskRunner pid=37599)^[[0m              'validation_data_dir': None}}
^[[36m(TaskRunner pid=37599)^[[0m Using dataset class: RLHFDataset
^[[36m(TaskRunner pid=37599)^[[0m dataset len: 7473
^[[36m(TaskRunner pid=37599)^[[0m filter dataset len: 7473
^[[36m(TaskRunner pid=37599)^[[0m Using dataset class: RLHFDataset
^[[36m(TaskRunner pid=37599)^[[0m dataset len: 1319
^[[36m(TaskRunner pid=37599)^[[0m filter dataset len: 1319
^[[36m(TaskRunner pid=37599)^[[0m [validate_config] All configuration checks passed successfully!
^[[36m(TaskRunner pid=37599)^[[0m Size of train dataloader: 7, Size of val dataloader: 1
^[[36m(TaskRunner pid=37599)^[[0m Total training steps: 105
^[[36m(TaskRunner pid=37599)^[[0m colocated worker base class <class 'verl.single_controller.base.worker.Worker'>
^[[36m(WorkerDict pid=39180)^[[0m Monkey patch _flash_attention_forward in transformers.integrations.flash_attention
^[[36m(WorkerDict pid=38869)^[[0m Model config after override: Qwen2Config {
^[[36m(WorkerDict pid=38869)^[[0m   "architectures": [
^[[36m(WorkerDict pid=38869)^[[0m     "Qwen2ForCausalLM"
^[[36m(WorkerDict pid=38869)^[[0m   ],
^[[36m(WorkerDict pid=38869)^[[0m   "attention_dropout": 0.0,
^[[36m(WorkerDict pid=38869)^[[0m   "eos_token_id": 151645,
^[[36m(WorkerDict pid=38869)^[[0m   "hidden_act": "silu",
^[[36m(WorkerDict pid=38869)^[[0m   "hidden_size": 896,
^[[36m(WorkerDict pid=38869)^[[0m   "initializer_range": 0.02,
^[[36m(WorkerDict pid=38869)^[[0m   "intermediate_size": 4864,
^[[36m(WorkerDict pid=38869)^[[0m   "max_position_embeddings": 32768,
^[[36m(WorkerDict pid=38869)^[[0m   "max_window_layers": 21,
^[[36m(WorkerDict pid=38869)^[[0m   "model_type": "qwen2",
^[[36m(WorkerDict pid=38869)^[[0m   "num_attention_heads": 14,
^[[36m(WorkerDict pid=38869)^[[0m   "num_hidden_layers": 24,
^[[36m(WorkerDict pid=38869)^[[0m   "num_key_value_heads": 2,
^[[36m(WorkerDict pid=38869)^[[0m   "pad_token_id": 151643,
^[[36m(WorkerDict pid=38869)^[[0m   "rms_norm_eps": 1e-06,
^[[36m(WorkerDict pid=38869)^[[0m   "rope_scaling": null,
^[[36m(WorkerDict pid=38869)^[[0m   "rope_theta": 1000000.0,
^[[36m(WorkerDict pid=38869)^[[0m   "sliding_window": 32768,
^[[36m(WorkerDict pid=38869)^[[0m   "tie_word_embeddings": true,
^[[36m(WorkerDict pid=38869)^[[0m   "torch_dtype": "bfloat16",
^[[36m(WorkerDict pid=38869)^[[0m   "transformers_version": "4.51.1",
^[[36m(WorkerDict pid=38869)^[[0m   "use_cache": true,
^[[36m(WorkerDict pid=38869)^[[0m   "use_sliding_window": false,
^[[36m(WorkerDict pid=38869)^[[0m   "vocab_size": 151936
^[[36m(WorkerDict pid=38869)^[[0m }
^[[36m(WorkerDict pid=38869)^[[0m
^[[36m(WorkerDict pid=38869)^[[0m Qwen2ForCausalLM contains 494.03M parameters
^[[36m(WorkerDict pid=38869)^[[0m wrap_policy: functools.partial(<function _or_policy at 0x403073bd3920>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x403073bd37e0>, transformer_layer_cls={<class 'transformers.models.qwen2.modeling_qwen2.Qwen2DecoderLayer'>})])
^[[36m(WorkerDict pid=38869)^[[0m Monkey patch _flash_attention_forward in transformers.integrations.flash_attention^[[32m [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)^[[0m
^[[36m(WorkerDict pid=38869)^[[0m Actor use_remove_padding=True
^[[36m(WorkerDict pid=38869)^[[0m Model config after override: Qwen2Config {
^[[36m(WorkerDict pid=38869)^[[0m   "architectures": [
^[[36m(WorkerDict pid=38869)^[[0m     "Qwen2ForCausalLM"
^[[36m(WorkerDict pid=38869)^[[0m   ],
^[[36m(WorkerDict pid=38869)^[[0m   "attention_dropout": 0.0,
^[[36m(WorkerDict pid=38869)^[[0m   "eos_token_id": 151645,
^[[36m(WorkerDict pid=38869)^[[0m   "hidden_act": "silu",
^[[36m(WorkerDict pid=38869)^[[0m   "hidden_size": 896,
^[[36m(WorkerDict pid=38869)^[[0m   "initializer_range": 0.02,
^[[36m(WorkerDict pid=38869)^[[0m   "intermediate_size": 4864,
^[[36m(WorkerDict pid=38869)^[[0m   "max_position_embeddings": 32768,
^[[36m(WorkerDict pid=38869)^[[0m   "max_window_layers": 21,
^[[36m(WorkerDict pid=38869)^[[0m   "model_type": "qwen2",
^[[36m(WorkerDict pid=38869)^[[0m   "num_attention_heads": 14,
^[[36m(WorkerDict pid=38869)^[[0m   "num_hidden_layers": 24,
^[[36m(WorkerDict pid=38869)^[[0m   "num_key_value_heads": 2,
^[[36m(WorkerDict pid=38869)^[[0m   "pad_token_id": 151643,
^[[36m(WorkerDict pid=38869)^[[0m   "rms_norm_eps": 1e-06,
^[[36m(WorkerDict pid=38869)^[[0m   "rope_scaling": null,
^[[36m(WorkerDict pid=38869)^[[0m   "rope_theta": 1000000.0,
^[[36m(WorkerDict pid=38869)^[[0m   "sliding_window": 32768,
^[[36m(WorkerDict pid=38869)^[[0m   "tie_word_embeddings": true,
^[[36m(WorkerDict pid=38869)^[[0m   "torch_dtype": "bfloat16",
^[[36m(WorkerDict pid=38869)^[[0m   "transformers_version": "4.51.1",
^[[36m(WorkerDict pid=38869)^[[0m   "use_cache": true,
^[[36m(WorkerDict pid=38869)^[[0m   "use_sliding_window": false,
^[[36m(WorkerDict pid=38869)^[[0m   "vocab_size": 151936
^[[36m(WorkerDict pid=38869)^[[0m }
^[[36m(WorkerDict pid=38869)^[[0m
^[[36m(WorkerDict pid=38869)^[[0m Qwen2ForCausalLM contains 494.03M parameters
^[[36m(WorkerDict pid=38869)^[[0m wrap_policy: functools.partial(<function _or_policy at 0x403073bd3920>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x403073bd37e0>, transformer_layer_cls={<class 'transformers.models.qwen2.modeling_qwen2.Qwen2DecoderLayer'>})])^[[32m [repeated 4x across cluster]^[[0m
^[[36m(WorkerDict pid=38869)^[[0m Monkey patch _flash_attention_forward in transformers.integrations.flash_attention^[[32m [repeated 4x across cluster]^[[0m
^[[36m(WorkerDict pid=38869)^[[0m Total steps: 105, num_warmup_steps: 0
^[[36m(WorkerDict pid=39179)^[[0m kwargs: {'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
^[[36m(WorkerDict pid=39179)^[[0m Actor use_remove_padding=True^[[32m [repeated 7x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m wrap_policy: functools.partial(<function _or_policy at 0x403065897920>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x4030658977e0>, transformer_layer_cls={<class 'transformers.models.qwen2.modeling_qwen2.Qwen2DecoderLayer'>})])^[[32m [repeated 3x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m Total steps: 105, num_warmup_steps: 0^[[32m [repeated 3x across cluster]^[[0m
^[[36m(WorkerDict pid=39181)^[[0m kwargs: {'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}^[[32m [repeated 2x across cluster]^[[0m
^[[36m(TaskRunner pid=37599)^[[0m Using LocalLogger is deprecated. The constructor API will change
^[[36m(TaskRunner pid=37599)^[[0m Checkpoint tracker file does not exist: %s /workspace/verl/checkpoints/verl_grpo_example_gsm8k/sgl_2_grpo_GSM8k_qwen0.5_test/latest_checkpointed_iteration.txt
^[[36m(TaskRunner pid=37599)^[[0m Training from scratch
^[[36m(TaskRunner pid=37599)^[[0m test_gen_batch meta info: {'eos_token_id': 151645, 'pad_token_id': 151643, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
^[[36m(WorkerDict pid=38869)^[[0m self.sampling_params={'n': 1, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
^[[36m(WorkerDict pid=39180)^[[0m kwargs: {'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
^[[36m(TaskRunner pid=37599)^[[0m validation generation end
^[[36m(WorkerDict pid=39180)^[[0m self.sampling_params={'n': 1, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}^[[32m [repeated 3x across cluster]^[[0m
^[[36m(TaskRunner pid=37599)^[[0m [prompt] system
^[[36m(TaskRunner pid=37599)^[[0m You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
^[[36m(TaskRunner pid=37599)^[[0m user
^[[36m(TaskRunner pid=37599)^[[0m Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####".
^[[36m(TaskRunner pid=37599)^[[0m assistant
^[[36m(TaskRunner pid=37599)^[[0m
^[[36m(TaskRunner pid=37599)^[[0m [response] To determine how much Janet makes at the farmers' market every day, we need to follow these steps:
^[[36m(TaskRunner pid=37599)^[[0m
^[[36m(TaskRunner pid=37599)^[[0m 1. **Calculate the total number of eggs laid by the ducks in a day:**
^[[36m(TaskRunner pid=37599)^[[0m    - Janet's ducks lay 16 eggs per day.
^[[36m(TaskRunner pid=37599)^[[0m
^[[36m(TaskRunner pid=37599)^[[0m 2. **Calculate the total number of eggs Janet eats in a day:**
^[[36m(TaskRunner pid=37599)^[[0m    - Janet eats 3 eggs for breakfast.
^[[36m(TaskRunner pid=37599)^[[0m    - She eats 4 muffins for baking.
^[[36m(TaskRunner pid=37599)^[[0m    - Therefore, the total number of eggs she eats in a day is:
^[[36m(TaskRunner pid=37599)^[[0m      \[
^[[36m(TaskRunner pid=37599)^[[0m      3 \text{ (breakfast)} + 4 \text{ (baking)} = 7 \text{ eggs}
^[[36m(TaskRunner pid=37599)^[[0m      \]
^[[36m(TaskRunner pid=37599)^[[0m
^[[36m(TaskRunner pid=37599)^[[0m 3. **Calculate the number of eggs Janet sells at the farmers' market in a day:**
^[[36m(TaskRunner pid=37599)^[[0m    - She sells the remainder of the eggs at the farmers' market.
^[[36m(TaskRunner pid=37599)^[[0m    - The total number of eggs laid in a day is 16.
^[[36m(TaskRunner pid=37599)^[[0m    - Subtract the number of eggs she eats from the total:
^[[36m(TaskRunner pid=37599)^[[0m      \[
^[[36m(TaskRunner pid=37599)^[[0m      16 \text{ (total eggs)} - 7 \text{ (eggs eaten)} = 9 \text{ eggs}
^[[36m(TaskRunner pid=37599)^[[0m      \]
^[[36m(TaskRunner pid=37599)^[[0m
^[[36m(TaskRunner pid=37599)^[[0m 4. **Calculate the total revenue from selling the eggs at the farmers' market:**
^[[36m(TaskRunner pid=37599)^[[0m    - Each egg is sold for $2.
^[[36m(TaskRunner pid=37599)^[[0m    - The number of eggs sold is 9.
^[[36m(TaskRunner pid=37599)^[[0m    - Therefore, the total revenue is:
^[[36m(TaskRunner pid=37599)^[[0m      \[
^[[36m(TaskRunner pid=37599)^[[0m      9 \text{ eggs} \times 2 \text{ dollars/egg} = 18 \text{ dollars}
^[[36m(TaskRunner pid=37599)^[[0m      \]
^[[36m(TaskRunner pid=37599)^[[0m
^[[36m(TaskRunner pid=37599)^[[0m Thus, Janet makes \(\boxed{18}\) dollars every day at the farmers' market.
^[[36m(TaskRunner pid=37599)^[[0m [ground_truth] 18
^[[36m(TaskRunner pid=37599)^[[0m [score] 0.0
^[[36m(TaskRunner pid=37599)^[[0m ("Initial validation metrics: {'val-core/openai/gsm8k/reward/mean@1': "
^[[36m(TaskRunner pid=37599)^[[0m  '0.000758150113722517}')
^[[36m(TaskRunner pid=37599)^[[0m step:0 - val-core/openai/gsm8k/reward/mean@1:0.001
^[[36m(TaskRunner pid=37599)^[[0m list(reward_extra_infos_dict.keys())=[]
^[[36m(WorkerDict pid=39179)^[[0m self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}^[[32m [repeated 4x across cluster]^[[0m
^[[36m(TaskRunner pid=37599)^[[0m step:1 - global_seqlen/min:544805.000 - global_seqlen/max:549651.000 - global_seqlen/minmax_diff:4846.000 - global_seqlen/balanced_min:547491.000 - global_seqlen/balanced_max:547491.000 - global_seqlen/mean:547491.000 - actor/entropy_loss:0.559 - actor/kl_loss:0.000 - actor/kl_coef:0.001 - actor/pg_loss:-0.002 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.063 - perf/mfu/actor:0.935 - perf/max_memory_allocated_gb:24.369 - perf/max_memory_reserved_gb:65.512 - perf/cpu_memory_used_gb:354.743 - actor/lr:0.000 - training/global_step:1.000 - training/epoch:0.000 - critic/score/mean:0.012 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.012 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.004 - critic/advantages/max:1.789 - critic/advantages/min:-0.730 - critic/returns/mean:-0.004 - critic/returns/max:1.789 - critic/returns/min:-0.730 - response_length/mean:323.300 - response_length/max:1024.000 - response_length/min:2.000 - response_length/clip_ratio:0.005 - prompt_length/mean:104.428 - prompt_length/max:215.000 - prompt_length/min:65.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:73.388 - timing_s/reward:0.994 - timing_s/old_log_prob:6.349 - timing_s/ref:3.266 - timing_s/adv:0.089 - timing_s/update_actor:15.565 - timing_s/step:99.733 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.001 - timing_per_token_ms/update_actor:0.007 - timing_per_token_ms/gen:0.044 - perf/total_num_tokens:2189964.000 - perf/time_per_step:99.733 - perf/throughput:5489.549
^[[36m(WorkerDict pid=38869)^[[0m self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
^[[36m(TaskRunner pid=37599)^[[0m list(reward_extra_infos_dict.keys())=[]
^[[36m(WorkerDict pid=39179)^[[0m self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}^[[32m [repeated 3x across cluster]^[[0m
^[[36m(TaskRunner pid=37599)^[[0m step:2 - global_seqlen/min:543246.000 - global_seqlen/max:557058.000 - global_seqlen/minmax_diff:13812.000 - global_seqlen/balanced_min:550333.000 - global_seqlen/balanced_max:550334.000 - global_seqlen/mean:550333.750 - actor/entropy_loss:0.578 - actor/kl_loss:0.001 - actor/kl_coef:0.001 - actor/pg_loss:0.005 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.043 - perf/mfu/actor:1.047 - perf/max_memory_allocated_gb:26.904 - perf/max_memory_reserved_gb:65.512 - perf/cpu_memory_used_gb:331.719 - actor/lr:0.000 - training/global_step:2.000 - training/epoch:0.000 - critic/score/mean:0.011 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.011 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.001 - critic/advantages/max:1.789 - critic/advantages/min:-0.730 - critic/returns/mean:-0.001 - critic/returns/max:1.789 - critic/returns/min:-0.730 - response_length/mean:327.414 - response_length/max:1024.000 - response_length/min:3.000 - response_length/clip_ratio:0.004 - prompt_length/mean:102.534 - prompt_length/max:256.000 - prompt_length/min:63.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:73.077 - timing_s/reward:1.019 - timing_s/old_log_prob:3.699 - timing_s/ref:3.076 - timing_s/adv:0.086 - timing_s/update_actor:13.993 - timing_s/step:95.031 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.001 - timing_per_token_ms/update_actor:0.006 - timing_per_token_ms/gen:0.044 - perf/total_num_tokens:2201335.000 - perf/time_per_step:95.031 - perf/throughput:5791.128
^[[36m(WorkerDict pid=38869)^[[0m self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
^[[36m(TaskRunner pid=37599)^[[0m list(reward_extra_infos_dict.keys())=[]
^[[36m(WorkerDict pid=39179)^[[0m self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}^[[32m [repeated 3x across cluster]^[[0m
^[[36m(TaskRunner pid=37599)^[[0m step:3 - global_seqlen/min:530230.000 - global_seqlen/max:546595.000 - global_seqlen/minmax_diff:16365.000 - global_seqlen/balanced_min:540558.000 - global_seqlen/balanced_max:540558.000 - global_seqlen/mean:540558.000 - actor/entropy_loss:0.556 - actor/kl_loss:0.003 - actor/kl_coef:0.001 - actor/pg_loss:0.010 - actor/pg_clipfrac:0.000 - actor/ppo_kl:0.001 - actor/pg_clipfrac_lower:0.000 - actor/grad_norm:0.072 - perf/mfu/actor:1.043 - perf/max_memory_allocated_gb:26.904 - perf/max_memory_reserved_gb:65.512 - perf/cpu_memory_used_gb:331.025 - actor/lr:0.000 - training/global_step:3.000 - training/epoch:0.000 - critic/score/mean:0.021 - critic/score/max:1.000 - critic/score/min:0.000 - critic/rewards/mean:0.021 - critic/rewards/max:1.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.005 - critic/advantages/max:1.789 - critic/advantages/min:-0.730 - critic/returns/mean:-0.005 - critic/returns/max:1.789 - critic/returns/min:-0.730 - response_length/mean:318.616 - response_length/max:1024.000 - response_length/min:3.000 - response_length/clip_ratio:0.005 - prompt_length/mean:103.695 - prompt_length/max:201.000 - prompt_length/min:68.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:72.185 - timing_s/reward:0.989 - timing_s/old_log_prob:3.666 - timing_s/ref:3.055 - timing_s/adv:0.087 - timing_s/update_actor:13.782 - timing_s/step:93.844 - timing_per_token_ms/adv:0.000 - timing_per_token_ms/ref:0.001 - timing_per_token_ms/update_actor:0.006 - timing_per_token_ms/gen:0.044 - perf/total_num_tokens:2162232.000 - perf/time_per_step:93.844 - perf/throughput:5760.155
^[[36m(WorkerDict pid=38869)^[[0m self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
^[[36m(WorkerDict pid=39180)^[[0m
^[[36m(WorkerDict pid=39180)^[[0m nid006560:39180:39713 [0] nccl_net_ofi_rdma_init:7978 NCCL WARN NET/OFI OFI fi_getinfo() call failed: No data available
^[[36m(WorkerDict pid=39180)^[[0m NCCL version 2.25.1+cuda12.8
^[[36m(WorkerDict pid=39179)^[[0m self.sampling_params={'n': 5, 'max_new_tokens': 1024, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'repetition_penalty': 1.0, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}^[[32m [repeated 3x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m
^[[33m(raylet)^[[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffae15d88249362b1f0287b91101000000 Worker ID: 562ae018225fb54e48662527647429d0b31b3d7468c7ea27c250ea93 Node ID: ab9670e90eb16492b726ba8a0999a0f200bf92170ae8ee276d369ef1 Worker IP address: 172.28.31.248 Worker port: 35381 Worker PID: 38869 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_files=/workspace/verl/data/train.parquet', 'data.val_files=/workspace/verl/data/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=1024', 'data.max_response_length=1024', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=sglang', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.6', 'actor_rollout_ref.rollout.n=5', 'actor_rollout_ref.rollout.enforce_eager=False', 'actor_rollout_ref.rollout.free_cache_engine=False', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'actor_rollout_ref.ref.strategy=fsdp2', 'actor_rollout_ref.actor.strategy=fsdp2', 'critic.strategy=fsdp2', 'reward_model.strategy=fsdp2', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console,wandb]', 'trainer.project_name=verl_grpo_example_gsm8k', 'trainer.experiment_name=sgl_2_grpo_GSM8k_qwen0.5_test', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=5', 'trainer.total_epochs=15']
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/verl/verl/trainer/main_ppo.py", line 246, in <module>
    main()
  File "/usr/local/lib/python3.12/dist-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/usr/local/lib/python3.12/dist-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/verl/verl/trainer/main_ppo.py", line 64, in main
    run_ppo(config)
  File "/workspace/verl/verl/trainer/main_ppo.py", line 76, in run_ppo
    ray.get(runner.run.remote(config))
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2822, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 930, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): ^[[36mray::TaskRunner.run()^[[39m (pid=37599, ip=172.28.31.248, actor_id=9def2e14010cfa63ade4b3f001000000, repr=<main_ppo.TaskRunner object at 0x400032c09eb0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/verl/verl/trainer/main_ppo.py", line 183, in run
    trainer.fit()
  File "/workspace/verl/verl/trainer/ppo/ray_trainer.py", line 911, in fit
    gen_batch_output = self.actor_rollout_wg.generate_sequences(gen_batch)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/verl/verl/single_controller/ray/base.py", line 49, in func
    output = ray.get(output)
             ^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
        class_name: create_colocated_worker_cls.<locals>.WorkerDict
        actor_id: ae15d88249362b1f0287b91101000000
        pid: 38869
        name: ItrCVfWorkerDict_0:0
        namespace: 7286510c-580a-4507-b3fd-c6b7821e4d0b
        ip: 172.28.31.248
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
^[[36m(WorkerDict pid=39179)^[[0m [rank1]:[W518 14:51:06.739468254 TCPStore.cpp:125] [c10d] recvValue failed on SocketImpl(fd=355, addr=[nid006560]:57648, remote=[nid006560]:42823): failed to recv, got 0 bytes
^[[36m(WorkerDict pid=39179)^[[0m Exception raised from recvBytes at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/Utils.hpp:671 (most recent call first):^[[32m [repeated 2x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd4 (0x402fbea2a9a4 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)^[[32m [repeated 2x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m frame #8: <unknown function> + 0xeba4c (0x40000fe1ba4c in /usr/lib/aarch64-linux-gnu/libc.so.6)^[[32m [repeated 12x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x1cc (0x402f742f6e4c in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cpu.so)^[[32m [repeated 2x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x280 (0x402f772ef670 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)^[[32m [repeated 2x across cluster]^[[0m
^[[36m(WorkerDict pid=39179)^[[0m [rank1]:[W518 14:51:06.742013584 ProcessGroupNCCL.cpp:1671] [PG ID 0 PG GUID 0(default_pg) Rank 1] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: failed to recv, got 0 bytes

May 18 '25 13:05 EduardDurech

@EduardDurech Hi Eduard, thanks for your detailed profiling. I've done some profiling from our side by running QWen 7B GRPO using almost the same setup as the verl's recipe.

I do notice that SGLang is slower than vLLM on H200, especially within single node, which is the opposite way when we benchmark SGL vs vLLM directly within inference engine benchmark.

I'll check with @zhaochenyang20 and @ocss884 to see if we can find the bottleneck and speed up

May 18 '25 23:05 hebiao064

I do notice that SGLang is slower than vLLM on H200, especially within single node, which is the opposite way when we benchmark SGL vs vLLM directly within inference engine benchmark.

I'll check with @zhaochenyang20 and @ocss884 to see if we can find the bottleneck and speed up

Sounds good, same that I see, opposite of independent vLLM vs. SGLang

We can also profile but having different issues of getting Prometheus/Grafana to properly trace veRL, maybe you have some tips or know someone who has experience setting it up for veRL?

May 19 '25 09:05 EduardDurech

Thanks, let me raise this in verl community. @EduardDurech

May 19 '25 17:05 zhaochenyang20

is this issue resolved?

May 26 '25 03:05 eric-haibin-lin

Still profiling it. @hebiao064 Are there any update 😭

May 26 '25 05:05 zhaochenyang20

Hey guys, any updates?

Jun 03 '25 23:06 EduardDurech

@EduardDurech Hi Eduard, tbh I don't have insightful update but just to share what i did:

Some interesting finding:

In CUDA 12.6, sgl and vllm are on par
In CUDA 12.8, sgl drops a lot when changing to TP2, there are some guess about whether it's due to NCCL versions

Besides, I've tried something:

Enlarge the Max CUDA Graph Batch Size for sglang (also need to enlarge the interval), since the batch size can go up to 1280 but the max capture size is 256, however, I didn't see much speed up
I tried to run some profiling, and (of course) I found the all reduce is kinda slow. So if you can fit your model in one gpu, please keep using TP 1 for now. And running profiling for verl + sglang is not quite straightforward as I need to modify some code in scheduler, verl_engine, and also simply adding torch profiler doesn't work since verl's ray ppo process is not the same process as SGLang's TP Worker.

Jun 04 '25 06:06 hebiao064

@hebiao064 great, thanks, we're training 8B and 70B so unfortunately need to use TP=2, CUDA 12.8

It is weird that gen is slower in veRL though than standalone, no? SGLang is roughly twice the throughput for me in normal inference https://github.com/volcengine/verl/issues/1208#issuecomment-2888733186
Yea, this was something we noticed 😃 have you had the same experience with NSight?

Jun 04 '25 18:06 EduardDurech

It is weird that gen is slower in veRL though than standalone, no? SGLang is roughly twice the throughput for me in normal inference veRL-SGLang slower than expected (GH200) #1208 (comment)

Yes its very weird, I compared with vllm with DP + TP as well and SGLang is better or on par. I am not sure why veRL's multiple SGLang Process (mimic DP) + TP 2 will be slower, it doubt it be due to logic of how verl dispatch data and orchestrate SGLang.
I haven't tried nsight, profiling is pretty hard as well given we can only profile few steps.

Jun 04 '25 23:06 hebiao064

Is this issue specific to GH200? Do you experience similar slowdown with other hardware types? cc @davidmlw

Jul 15 '25 16:07 eric-haibin-lin

Our cluster is homogenous so nothing else to test on, but @hebiao064 reproduced the slowdown, maybe you guys have other hardware to test on?

Jul 15 '25 17:07 EduardDurech

@ocss884 @hebiao064 can we contact you offline, this is quite critical and we have quite a large project https://news.ycombinator.com/item?id=44535637

Aug 05 '25 13:08 EduardDurech

@EduardDurech Hi Eduard, thanks for your detailed profiling. I've done some profiling from our side by running QWen 7B GRPO using almost the same setup as the verl's recipe.

I do notice that SGLang is slower than vLLM on H200, especially within single node, which is the opposite way when we benchmark SGL vs vLLM directly within inference engine benchmark.

I'll check with @zhaochenyang20 and @ocss884 to see if we can find the bottleneck and speed up

Hi I wonder if the problem is resolved. I also experience much slower verl + sglang compared to verl + vllm. I wonder if you could provide 2 running scripts (+ configs) for your benchmark on H200. Really thanks a lot in advance!

Aug 08 '25 22:08 Cranial-XIX

Same problem appear in our tests, sglang's cross_device_reduce_1staget op is much slower in verl than that in the normal sglang inference process.

Sep 10 '25 06:09 squall1988

veRL-SGLang on GH200 aarch64 cluster, I got installation working and standalone SGLang works as well as veRL-vLLM, however it seems something is not working well with veRL-SGLang as it's significantly slower (uses much less memory though), I tested torch_memory_saver and it works standalone, if you have anything that I should debug can test on our end

Versions

flash_attn 2.7.3 flash_attn_3 3.0.0b1 flashinfer-python 0.2.2.post1 sgl-kernel 0.0.9.post2 sglang 0.4.5.post3 torch_memory_saver 0.0.5 verl 0.3.0.post1 vllm 0.8.3 FA3, flashinfer, sgl-kernel, sglang, torch_memory_saver, verl, and vllm built from source

veRL Trainer vLLM equivalent with actor_rollout_ref.rollout.name=vllm

python3 -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
data.train_files=$(pwd)/data/train.parquet
data.val_files=$(pwd)/data/test.parquet
data.train_batch_size=1024
data.max_prompt_length=1024
data.max_response_length=1024
data.filter_overlong_prompts=True
data.truncation='error'
actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=40
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=40
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.name=sglang
actor_rollout_ref.rollout.gpu_memory_utilization=0.6
actor_rollout_ref.rollout.n=5
actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=40
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.kl_ctrl.kl_coef=0.001
trainer.critic_warmup=0
trainer.logger=['console','wandb']
trainer.project_name='verl_grpo_example_gsm8k'
trainer.experiment_name='grpo_GSM8k_qwen0.5_test'
trainer.n_gpus_per_node=4
trainer.nnodes=1
trainer.save_freq=-1
trainer.test_freq=5
trainer.total_epochs=15
"$@" Blue SGLang, Pink FA2-vLLM, Green FA3-vLLM (I don't know if veRL is actually exploiting FA3)

More graphs

Log

Do you mind sharing how to build VERL on GH200? I am struggling with the aarch64 architecture. Thank you.

Sep 21 '25 22:09 siriluo

Do you mind sharing how to build VERL on GH200? I am struggling with the aarch64 architecture. Thank you.

You can email me so the thread stays on topic but the main issue is building your inference engine, vLLM has wheels for aarch64 and SGLang has a Blackwell image, that is the easiest, else I build most of my packages from source

Sep 21 '25 22:09 EduardDurech