verl (raylet) A worker died or was killed while executing a task by an unexpected system error.

trafficstars

The program ran normally for the first few attempts, but suddenly encountered the following error during a specific run. Now this error persists permanently. (WorkerDict pid=318460) INFO 03-02 14:03:38 selector.py:115] Using XFormers backend. (WorkerDict pid=319149) NCCL version 2.20.5+cuda12.4 (WorkerDict pid=319150) Total steps: 19200, num_warmup_steps: 0 [repeated 3x across cluster] (WorkerDict pid=318460) before init cache memory allocated: 14.50004992GB, reserved: 14.71152128GB (WorkerDict pid=318460) after init cache memory allocated: 41.516872704GB, reserved: 41.781559296GB (WorkerDict pid=318460) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 1024, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} (WorkerDict pid=318460) After building vllm rollout, memory allocated (GB): 32.68087673187256, memory reserved (GB): 38.912109375 (WorkerDict pid=318460) After building sharding manager, memory allocated (GB): 32.68087673187256, memory reserved (GB): 38.912109375 (WorkerDict pid=319150) WARNING 03-02 14:03:37 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used [repeated 3x across cluster] (WorkerDict pid=319151) local rank 0 [repeated 3x across cluster] (WorkerDict pid=319151) INFO 03-02 14:03:38 selector.py:115] Using XFormers backend. [repeated 7x across cluster] **(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: f62e6c8941e9846f2613d2796b130d8de9bac89b01000000 Worker ID: bdc1b3f207dc58986a1130ea8780e2fc4b50269373dcf11be8121bb2 Node ID: c917744d524b83efa40873a8c07b861c45b99bf725b9b03635c1a076 Worker IP address: 10.49.160.102 Worker port: 33863 Worker PID: 317470 Worker exit type: SYSTEM_ERROR Worker exit detail: The leased worker has unrecoverable failure. Worker is requested to be destroyed when it is returned. Worker exits with an exit code None.** ray.exceptions.RayTaskError(RaySystemError): ray::main_task() (pid=318138, ip=10.49.160.102) File "/verl/trainer/main_ppo.py", line 188, in main_task trainer.init_workers() File "/verl/trainer/ppo/ray_trainer.py", line 498, in init_workers wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls) File "/verl/single_controller/ray/base.py", line 195, in __init__ self._init_with_resource_pool(resource_pool=resource_pool, File "/verl/single_controller/ray/base.py", line 218, in _init_with_resource_pool pgs = resource_pool.get_placement_groups(strategy=strategy) File "/verl/single_controller/ray/base.py", line 78, in get_placement_groups pgs = [ File "/verl/single_controller/ray/base.py", line 79, in <listcomp> placement_group(bundles=bundles, strategy=strategy, name=pg_name_prefix + str(idx), lifetime=lifetime) File "/usr/local/lib/python3.10/dist-packages/ray/util/placement_group.py", line 211, in placement_group placement_group_id = worker.core_worker.create_placement_group( File "python/ray/includes/common.pxi", line 108, in ray._raylet.check_status ray.exceptions.RaySystemError: System error: Failed to create placement group '37f2cda36162ae36e11df8a30a0901000000' because name 'global_poolverl_group_4:0' already exists.

python3 -m verl.trainer.main_ppo
data.train_files=$DATA_DIR/train.parquet
data.val_files=$DATA_DIR/test.parquet
data.train_batch_size=256
data.val_batch_size=1312
data.max_prompt_length=256
data.max_response_length=1024
actor_rollout_ref.model.path=Llama-3.2-3B-Instruct
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.actor.ppo_mini_batch_size=128
actor_rollout_ref.actor.ppo_micro_batch_size=8
actor_rollout_ref.rollout.log_prob_micro_batch_size=8
actor_rollout_ref.rollout.tensor_model_parallel_size=1
actor_rollout_ref.rollout.gpu_memory_utilization=0.4
actor_rollout_ref.ref.log_prob_micro_batch_size=4
critic.optim.lr=1e-5
critic.model.path=Llama-3.2-3B-Instruct
critic.ppo_micro_batch_size=8
algorithm.kl_ctrl.kl_coef=0.001
trainer.logger=['wandb']
+trainer.val_before_train=False
trainer.default_hdfs_dir=null
trainer.n_gpus_per_node=4
trainer.nnodes=1
trainer.save_freq=100
trainer.test_freq=100
trainer.project_name=TinyZero
trainer.experiment_name=$EXPERIMENT_NAME
trainer.total_epochs=15 2>&1 | tee verl_demo.log

Mar 02 '25 14:03 anapple-hub

Did you monitor the system usage, such as cpu memory, did it fail due to OOM?

Mar 02 '25 22:03 eric-haibin-lin

@eric-haibin-lin I used nvitop for monitoring, and this is the system resource usage immediately before the 'worker died or was killed' error occurred.

Mar 03 '25 01:03 anapple-hub

The error seems to be:

ray.exceptions.RaySystemError: System error: Failed to create placement group '37f2cda36162ae36e11df8a30a0901000000' because name 'global_poolverl_group_4:0' already exists.

instead of OOM.

Mar 03 '25 19:03 kevin85421

@eric-haibin-lin would you mind adding a label ray so that I can track the progress?

Mar 03 '25 19:03 kevin85421

I encountered the same issue. I resolved it by reinstalling pytorch using the following command:

pip3 install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124

Maybe you can try reinstalling your pytorch to see if it resolves the problem.

Mar 05 '25 07:03 yk7333

verl verl copied to clipboard

(raylet) A worker died or was killed while executing a task by an unexpected system error.

verl
verl copied to clipboard