verl icon indicating copy to clipboard operation
verl copied to clipboard

i do not know how to deal with this bug

Open dongguanting opened this issue 10 months ago • 3 comments

(WorkerDict pid=59715) Qwen2ForCausalLM contains 494.03M parameters (WorkerDict pid=59715) Before building vllm rollout, memory allocated (GB): 0.9203834533691406, memory reserved (GB): 2.62890625 (WorkerDict pid=59715) INFO 03-04 17:15:40 config.py:1005] Chunked prefill is enabled with max_num_batched_tokens=8192. (WorkerDict pid=59715) WARNING 03-04 17:15:40 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used (WorkerDict pid=59999) Total steps: 435, num_warmup_steps: 0 [repeated 7x across cluster] (WorkerDict pid=59715) Critic use_remove_padding=False [repeated 3x across cluster] (WorkerDict pid=59999) wrap_policy: functools.partial(<function or_policy at 0x1529034daca0>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x1529034dab60>, transformer_layer_cls={<class 'transformers.models.qwen2.modeling_qwen2.Qwen2DecoderLayer'>})]) [repeated 7x across cluster] (WorkerDict pid=59999) Actor use_remove_padding=False [repeated 7x across cluster] (WorkerDict pid=59715) local rank 0 (WorkerDict pid=59978) NCCL version 2.20.5+cuda12.4 (WorkerDict pid=59715) before init cache memory allocated: 1.997223424GB, reserved: 2.059403264GB (WorkerDict pid=59715) after init cache memory allocated: 33.55516672GB, reserved: 33.61734656GB (WorkerDict pid=59999) /home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . (WorkerDict pid=59999) warnings.warn( (WorkerDict pid=59999) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the torch_dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16) [repeated 3x across cluster] (WorkerDict pid=59999) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 256, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} (WorkerDict pid=59999) INFO 03-04 17:15:40 config.py:1005] Chunked prefill is enabled with max_num_batched_tokens=8192. [repeated 3x across cluster] (WorkerDict pid=59999) WARNING 03-04 17:15:40 config.py:380] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used [repeated 3x across cluster] (WorkerDict pid=59999) local rank 0 [repeated 3x across cluster] (WorkerDict pid=59715) After building vllm rollout, memory allocated (GB): 30.322853088378906, memory reserved (GB): 31.30859375 (WorkerDict pid=59715) After building sharding manager, memory allocated (GB): 30.322853088378906, memory reserved (GB): 31.30859375 (WorkerDict pid=59999) NCCL version 2.20.5+cuda12.4 [repeated 2x across cluster] (main_task pid=59139) Using LocalLogger is deprecated. The constructor API will change (main_task pid=59139) Checkpoint tracker file does not exist: %s /home/u2024001021/verl-main/checkpoints/verl_examples/gsm8k/latest_checkpointed_iteration.txt (main_task pid=59139) Training from scratch (WorkerDict pid=59715) /tmp/tmplaiy5fz/main.c:6:23: fatal error: stdatomic.h: No such file or directory (WorkerDict pid=59715) #include <stdatomic.h> (WorkerDict pid=59715) ^ (WorkerDict pid=59715) compilation terminated. (WorkerDict pid=59985) /home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:689: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 3x across cluster] (WorkerDict pid=59985) warnings.warn( [repeated 3x across cluster] Error executing job with overrides: ['data.train_files=/home/u2024001021/datasets/gsm8k/train.parquet', 'data.val_files=/home/u2024001021/datasets/gsm8k/test.parquet', 'data.train_batch_size=256', 'data.max_prompt_length=512', 'data.max_response_length=256', 'actor_rollout_ref.model.path=/fs/archive/share/u2024001021/huggingface_models/Qwen2.5-0.5B-Instruct', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=64', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=1', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4', 'critic.optim.lr=1e-5', 'critic.model.path=/fs/archive/share/u2024001021/huggingface_models/Qwen2.5-0.5B-Instruct', 'critic.ppo_micro_batch_size_per_gpu=4', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.logger=[console]', '+trainer.val_before_train=False', 'trainer.default_hdfs_dir=null', 'trainer.n_gpus_per_node=4', 'trainer.nnodes=1', 'trainer.save_freq=10', 'trainer.test_freq=10', 'trainer.total_epochs=15'] (main_task pid=59139) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=59999, ip=10.0.0.1, actor_id=e84f6786088faaaca01435c301000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x1528c97f7e10>) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/single_controller/ray/base.py", line 399, in func (main_task pid=59139) return getattr(self.worker_dict[key], name)(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/single_controller/base/decorator.py", line 404, in inner (main_task pid=59139) return func(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/fsdp_workers.py", line 516, in compute_log_prob (main_task pid=59139) output = self.actor.compute_log_prob(data=data) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 214, in compute_log_prob (main_task pid=59139) _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 153, in _forward_micro_batch (main_task pid=59139) log_probs = logprobs_from_logits(logits, micro_batch['responses']) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 57, in logprobs_from_logits (main_task pid=59139) output = logprobs_from_logits_flash_attn(logits, labels) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 65, in logprobs_from_logits_flash_attn (main_task pid=59139) output = cross_entropy_loss(logits, labels) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss (main_task pid=59139) return CrossEntropyLoss.apply( (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply (main_task pid=59139) return super().apply(*args, **kwargs) # type: ignore[misc] (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 196, in forward (main_task pid=59139) cross_entropy_fwd_kernel[(n_rows,)]( (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in (main_task pid=59139) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 338, in run (main_task pid=59139) return self.fn.run(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run (main_task pid=59139) device = driver.active.get_current_device() (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in getattr (main_task pid=59139) self._initialize_obj() (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj (main_task pid=59139) self._obj = self._init_fn() (main_task pid=59139) ^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver (main_task pid=59139) return actives0 (main_task pid=59139) ^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in init (main_task pid=59139) self.utils = CudaUtils() # TODO: make static (main_task pid=59139) ^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in init (main_task pid=59139) mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils") (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src (main_task pid=59139) so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build (main_task pid=59139) ret = subprocess.check_call(cc_cmd) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/subprocess.py", line 413, in check_call (main_task pid=59139) raise CalledProcessError(retcode, cmd) (main_task pid=59139) subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp52w3y_1k/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp52w3y_1k/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp52w3y_1k', '-I/home/u2024001021/anaconda3/envs/EasyRL/include/python3.11']' returned non-zero exit status 1. (main_task pid=59139) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=59985, ip=10.0.0.1, actor_id=0bf4661fc8e3c7c16d10786a01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x14bd75ed4e10>) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/single_controller/ray/base.py", line 399, in func (main_task pid=59139) return getattr(self.worker_dict[key], name)(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/single_controller/base/decorator.py", line 404, in inner (main_task pid=59139) return func(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/fsdp_workers.py", line 516, in compute_log_prob (main_task pid=59139) output = self.actor.compute_log_prob(data=data) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 214, in compute_log_prob (main_task pid=59139) _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 153, in _forward_micro_batch (main_task pid=59139) log_probs = logprobs_from_logits(logits, micro_batch['responses']) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 57, in logprobs_from_logits (main_task pid=59139) output = logprobs_from_logits_flash_attn(logits, labels) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 65, in logprobs_from_logits_flash_attn (main_task pid=59139) output = cross_entropy_loss(logits, labels) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss (main_task pid=59139) return CrossEntropyLoss.apply( (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply (main_task pid=59139) return super().apply(*args, **kwargs) # type: ignore[misc] (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 196, in forward (main_task pid=59139) cross_entropy_fwd_kernel[(n_rows,)]( (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in (main_task pid=59139) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 338, in run (main_task pid=59139) return self.fn.run(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run (main_task pid=59139) device = driver.active.get_current_device() (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in getattr (main_task pid=59139) self._initialize_obj() (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj (main_task pid=59139) self._obj = self._init_fn() (main_task pid=59139) ^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver (main_task pid=59139) return actives0 (main_task pid=59139) ^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in init (main_task pid=59139) self.utils = CudaUtils() # TODO: make static (main_task pid=59139) ^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in init (main_task pid=59139) mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils") (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src (main_task pid=59139) so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build (main_task pid=59139) ret = subprocess.check_call(cc_cmd) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/subprocess.py", line 413, in check_call (main_task pid=59139) raise CalledProcessError(retcode, cmd) (main_task pid=59139) subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp3lz0noso/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp3lz0noso/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp3lz0noso', '-I/home/u2024001021/anaconda3/envs/EasyRL/include/python3.11']' returned non-zero exit status 1. (main_task pid=59139) Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=59715, ip=10.0.0.1, actor_id=a58fc893db60b47258ae314901000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x14d161850490>) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/single_controller/ray/base.py", line 399, in func (main_task pid=59139) return getattr(self.worker_dict[key], name)(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/single_controller/base/decorator.py", line 404, in inner (main_task pid=59139) return func(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/fsdp_workers.py", line 516, in compute_log_prob (main_task pid=59139) output = self.actor.compute_log_prob(data=data) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 214, in compute_log_prob (main_task pid=59139) _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 153, in _forward_micro_batch (main_task pid=59139) log_probs = logprobs_from_logits(logits, micro_batch['responses']) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 57, in logprobs_from_logits (main_task pid=59139) output = logprobs_from_logits_flash_attn(logits, labels) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 65, in logprobs_from_logits_flash_attn (main_task pid=59139) output = cross_entropy_loss(logits, labels) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss (main_task pid=59139) return CrossEntropyLoss.apply( (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply (main_task pid=59139) return super().apply(*args, **kwargs) # type: ignore[misc] (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 196, in forward (main_task pid=59139) cross_entropy_fwd_kernel[(n_rows,)]( (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in (main_task pid=59139) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 338, in run (main_task pid=59139) return self.fn.run(*args, **kwargs) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run (main_task pid=59139) device = driver.active.get_current_device() (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in getattr (main_task pid=59139) self._initialize_obj() (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj (main_task pid=59139) self._obj = self._init_fn() (main_task pid=59139) ^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in create_driver (main_task pid=59139) return actives0 (main_task pid=59139) ^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in init (main_task pid=59139) self.utils = CudaUtils() # TODO: make static (main_task pid=59139) ^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in init (main_task pid=59139) mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils") (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src (main_task pid=59139) so = build(name, src_path, tmpdir, library_dirs(), include_dir, libraries) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in build (main_task pid=59139) ret = subprocess.check_call(cc_cmd) (main_task pid=59139) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (main_task pid=59139) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/subprocess.py", line 413, in check_call (main_task pid=59139) raise CalledProcessError(retcode, cmd) (main_task pid=59139) subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmplaiy5fz/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmplaiy5fz/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmplaiy5fz', '-I/home/u2024001021/anaconda3/envs/EasyRL/include/python3.11']' returned non-zero exit status 1. Traceback (most recent call last): File "/home/u2024001021/verl-main/verl/trainer/main_ppo.py", line 25, in main run_ppo(config) File "/home/u2024001021/verl-main/verl/trainer/main_ppo.py", line 33, in run_ppo ray.get(main_task.remote(config, compute_score)) File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/ray/_private/worker.py", line 2753, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/ray/_private/worker.py", line 904, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(CalledProcessError): ray::main_task() (pid=59139, ip=10.0.0.1) File "/home/u2024001021/verl-main/verl/trainer/main_ppo.py", line 128, in main_task trainer.fit() File "/home/u2024001021/verl-main/verl/trainer/ppo/ray_trainer.py", line 949, in fit old_log_prob = self.actor_rollout_wg.compute_log_prob(batch) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.RayTaskError(CalledProcessError): ray::WorkerDict.actor_rollout_compute_log_prob() (pid=59978, ip=10.0.0.1, actor_id=5199388e8f71ae4d3a3a754401000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x1495c1f54390>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/single_controller/ray/base.py", line 399, in func return getattr(self.worker_dict[key], name)(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/single_controller/base/decorator.py", line 404, in inner return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/workers/fsdp_workers.py", line 516, in compute_log_prob output = self.actor.compute_log_prob(data=data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 214, in compute_log_prob _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/workers/actor/dp_actor.py", line 153, in _forward_micro_batch log_probs = logprobs_from_logits(logits, micro_batch['responses']) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 57, in logprobs_from_logits output = logprobs_from_logits_flash_attn(logits, labels) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/verl-main/verl/utils/torch_functional.py", line 65, in logprobs_from_logits_flash_attn output = cross_entropy_loss(logits, labels) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 319, in cross_entropy_loss return CrossEntropyLoss.apply( ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply return super().apply(*args, **kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/flash_attn/ops/triton/cross_entropy.py", line 196, in forward cross_entropy_fwd_kernel[(n_rows,)]( File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 338, in run return self.fn.run(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run device = driver.active.get_current_device() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in getattr self._initialize_obj() File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj self._obj = self._init_fn() ^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver return actives0 ^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in init self.utils = CudaUtils() # TODO: make static ^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in init mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build ret = subprocess.check_call(cc_cmd) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/subprocess.py", line 413, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpxkorz697/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpxkorz697/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/home/u2024001021/anaconda3/envs/EasyRL/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpxkorz697', '-I/home/u2024001021/anaconda3/envs/EasyRL/include/python3.11']' returned non-zero exit status 1.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. (WorkerDict pid=59985) kwargs: {'n': 1, 'logprobs': 1, 'max_tokens': 256, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False} [repeated 3x across cluster] (WorkerDict pid=59999) /tmp/tmp52w3y_1k/main.c:6:23: fatal error: stdatomic.h: No such file or directory [repeated 3x across cluster] (WorkerDict pid=59999) #include <stdatomic.h> [repeated 3x across cluster] (WorkerDict pid=59999) ^ [repeated 3x across cluster] (WorkerDict pid=59999) compilation terminated. [repeated 3x across cluster]

dongguanting avatar Mar 04 '25 09:03 dongguanting

look like this is the main error

(WorkerDict pid=1027) /tmp/tmpe9v2ibae/main.c:6:23: fatal error: stdatomic.h: No such file or directory (WorkerDict pid=1027) #include <stdatomic.h>

dongguanting avatar Mar 04 '25 09:03 dongguanting

I meet the same question.how to solve it?

xieguobin avatar Apr 18 '25 07:04 xieguobin

the same error, I also dont know how to solve it

liangDYL avatar May 27 '25 14:05 liangDYL

me too!

jxmorris12 avatar Jul 17 '25 19:07 jxmorris12

@dongguanting @jxmorris12 @liangDYL @xieguobin 我也遇到这个问题,请问你们解决了吗

xjtupy avatar Sep 15 '25 12:09 xjtupy

@dongguanting @jxmorris12 @liangDYL @xieguobin 我也遇到这个问题,请问你们解决了吗

我也遇到了,请问解决了吗

Yu7-code avatar Oct 22 '25 16:10 Yu7-code