verl icon indicating copy to clipboard operation
verl copied to clipboard

[Qwen2.5 + GRPO] Cannot Run on V100: float16 crashes, float32 fails during validation

Open lot-insts opened this issue 8 months ago • 9 comments

Hi, I'm trying to run GRPO training with VERL on NVIDIA V100 GPUs using the Qwen2.5-0.5B model.

Problem Summary:

  • When using float16 (actor_rollout_ref.rollout.dtype=float16), training fails almost immediately with errors related to FlashAttention / Triton.
  • When switching to float32, training can start and runs for a while, but fails later during evaluation with RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Environment:

  • GPU: V100 (Tesla, Compute Capability 7.0)
  • The conda environment is installed followed https://github.com/volcengine/verl/blob/main/docs/README_vllm0.8.md

The config as follows:

set -x

export VLLM_ATTENTION_BACKEND=XFORMERS
export HYDRA_FULL_ERROR=1 


python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/datasets/gsm8k/train.parquet \
    data.val_files=/datasets/gsm8k/test.parquet \
    data.train_batch_size=64 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    algorithm.use_kl_in_reward=False \
    actor_rollout_ref.model.path=/huggingface/Qwen2.5-0.5B \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=16 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.dtype=float16 \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_example_gsm8k' \
    trainer.experiment_name='qwen25_05b_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@
    

It will get errors

(verl) [root@bras-base-otwaon1140w-grc-29-76-64-194-36 verl]# bash start.sh 2>&1 | tee ./start_float16.log
+ export VLLM_ATTENTION_BACKEND=XFORMERS
+ VLLM_ATTENTION_BACKEND=XFORMERS
+ export HYDRA_FULL_ERROR=1
+ HYDRA_FULL_ERROR=1
+ python3 -m verl.trainer.main_ppo algorithm.adv_estimator=grpo data.train_files=/datasets/gsm8k/train.parquet data.val_files=/datasets/gsm8k/test.parquet data.train_batch_size=64 data.max_prompt_length=512 data.max_response_length=1024 data.filter_overlong_prompts=True data.truncation=error algorithm.use_kl_in_reward=False actor_rollout_ref.model.path=/huggingface/Qwen2.5-0.5B actor_rollout_ref.actor.optim.lr=1e-6 actor_rollout_ref.model.use_remove_padding=True actor_rollout_ref.actor.ppo_mini_batch_size=16 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 actor_rollout_ref.actor.use_kl_loss=True actor_rollout_ref.actor.kl_loss_coef=0.001 actor_rollout_ref.actor.kl_loss_type=low_var_kl actor_rollout_ref.actor.entropy_coeff=0 actor_rollout_ref.model.enable_gradient_checkpointing=True actor_rollout_ref.rollout.enable_chunked_prefill=False actor_rollout_ref.rollout.dtype=float16 actor_rollout_ref.actor.fsdp_config.param_offload=False actor_rollout_ref.actor.fsdp_config.optimizer_offload=False actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 actor_rollout_ref.rollout.tensor_model_parallel_size=2 actor_rollout_ref.rollout.name=vllm actor_rollout_ref.rollout.gpu_memory_utilization=0.6 actor_rollout_ref.rollout.n=5 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 actor_rollout_ref.ref.fsdp_config.param_offload=True trainer.critic_warmup=0 'trainer.logger=[console,wandb]' trainer.project_name=verl_grpo_example_gsm8k trainer.experiment_name=qwen25_05b_function_rm trainer.n_gpus_per_node=8 trainer.nnodes=1 trainer.save_freq=-1 trainer.test_freq=5 trainer.total_epochs=15
2025-04-09 14:27:43,972 INFO worker.py:1843 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(TaskRunner pid=1472490) {'actor_rollout_ref': {'actor': {'checkpoint': {'contents': ['model',
(TaskRunner pid=1472490)                                                              'optimizer',
(TaskRunner pid=1472490)                                                              'extra']},
(TaskRunner pid=1472490)                                  'clip_ratio': 0.2,
(TaskRunner pid=1472490)                                  'clip_ratio_c': 3.0,
(TaskRunner pid=1472490)                                  'clip_ratio_high': 0.2,
(TaskRunner pid=1472490)                                  'clip_ratio_low': 0.2,
(TaskRunner pid=1472490)                                  'entropy_coeff': 0,
(TaskRunner pid=1472490)                                  'fsdp_config': {'fsdp_size': -1,
(TaskRunner pid=1472490)                                                  'optimizer_offload': False,
(TaskRunner pid=1472490)                                                  'param_offload': False,
(TaskRunner pid=1472490)                                                  'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=1472490)                                  'grad_clip': 1.0,
(TaskRunner pid=1472490)                                  'kl_loss_coef': 0.001,
(TaskRunner pid=1472490)                                  'kl_loss_type': 'low_var_kl',
(TaskRunner pid=1472490)                                  'loss_agg_mode': 'token-mean',
(TaskRunner pid=1472490)                                  'optim': {'lr': 1e-06,
(TaskRunner pid=1472490)                                            'lr_warmup_steps': -1,
(TaskRunner pid=1472490)                                            'lr_warmup_steps_ratio': 0.0,
(TaskRunner pid=1472490)                                            'min_lr_ratio': None,
(TaskRunner pid=1472490)                                            'total_training_steps': -1,
(TaskRunner pid=1472490)                                            'warmup_style': 'constant',
(TaskRunner pid=1472490)                                            'weight_decay': 0.01},
(TaskRunner pid=1472490)                                  'ppo_epochs': 1,
(TaskRunner pid=1472490)                                  'ppo_max_token_len_per_gpu': 16384,
(TaskRunner pid=1472490)                                  'ppo_micro_batch_size': None,
(TaskRunner pid=1472490)                                  'ppo_micro_batch_size_per_gpu': 2,
(TaskRunner pid=1472490)                                  'ppo_mini_batch_size': 16,
(TaskRunner pid=1472490)                                  'shuffle': False,
(TaskRunner pid=1472490)                                  'strategy': 'fsdp',
(TaskRunner pid=1472490)                                  'ulysses_sequence_parallel_size': 1,
(TaskRunner pid=1472490)                                  'use_dynamic_bsz': False,
(TaskRunner pid=1472490)                                  'use_kl_loss': True,
(TaskRunner pid=1472490)                                  'use_torch_compile': True},
(TaskRunner pid=1472490)                        'hybrid_engine': True,
(TaskRunner pid=1472490)                        'model': {'enable_gradient_checkpointing': True,
(TaskRunner pid=1472490)                                  'external_lib': None,
(TaskRunner pid=1472490)                                  'override_config': {},
(TaskRunner pid=1472490)                                  'path': '/huggingface/Qwen2.5-0.5B',
(TaskRunner pid=1472490)                                  'use_remove_padding': True},
(TaskRunner pid=1472490)                        'ref': {'fsdp_config': {'param_offload': True,
(TaskRunner pid=1472490)                                                'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=1472490)                                'log_prob_max_token_len_per_gpu': 16384,
(TaskRunner pid=1472490)                                'log_prob_micro_batch_size': None,
(TaskRunner pid=1472490)                                'log_prob_micro_batch_size_per_gpu': 4,
(TaskRunner pid=1472490)                                'log_prob_use_dynamic_bsz': False,
(TaskRunner pid=1472490)                                'ulysses_sequence_parallel_size': 1},
(TaskRunner pid=1472490)                        'rollout': {'disable_log_stats': True,
(TaskRunner pid=1472490)                                    'do_sample': True,
(TaskRunner pid=1472490)                                    'dtype': 'float16',
(TaskRunner pid=1472490)                                    'enable_chunked_prefill': False,
(TaskRunner pid=1472490)                                    'enforce_eager': True,
(TaskRunner pid=1472490)                                    'free_cache_engine': True,
(TaskRunner pid=1472490)                                    'gpu_memory_utilization': 0.6,
(TaskRunner pid=1472490)                                    'ignore_eos': False,
(TaskRunner pid=1472490)                                    'load_format': 'dummy_dtensor',
(TaskRunner pid=1472490)                                    'log_prob_max_token_len_per_gpu': 16384,
(TaskRunner pid=1472490)                                    'log_prob_micro_batch_size': None,
(TaskRunner pid=1472490)                                    'log_prob_micro_batch_size_per_gpu': 4,
(TaskRunner pid=1472490)                                    'log_prob_use_dynamic_bsz': False,
(TaskRunner pid=1472490)                                    'max_model_len': None,
(TaskRunner pid=1472490)                                    'max_num_batched_tokens': 8192,
(TaskRunner pid=1472490)                                    'max_num_seqs': 1024,
(TaskRunner pid=1472490)                                    'n': 5,
(TaskRunner pid=1472490)                                    'name': 'vllm',
(TaskRunner pid=1472490)                                    'prompt_length': 512,
(TaskRunner pid=1472490)                                    'response_length': 1024,
(TaskRunner pid=1472490)                                    'temperature': 1.0,
(TaskRunner pid=1472490)                                    'tensor_model_parallel_size': 2,
(TaskRunner pid=1472490)                                    'top_k': -1,
(TaskRunner pid=1472490)                                    'top_p': 1,
(TaskRunner pid=1472490)                                    'use_fire_sampling': False,
(TaskRunner pid=1472490)                                    'val_kwargs': {'do_sample': False,
(TaskRunner pid=1472490)                                                   'n': 1,
(TaskRunner pid=1472490)                                                   'temperature': 0,
(TaskRunner pid=1472490)                                                   'top_k': -1,
(TaskRunner pid=1472490)                                                   'top_p': 1.0}}},
(TaskRunner pid=1472490)  'algorithm': {'adv_estimator': 'grpo',
(TaskRunner pid=1472490)                'gamma': 1.0,
(TaskRunner pid=1472490)                'kl_ctrl': {'horizon': 10000,
(TaskRunner pid=1472490)                            'kl_coef': 0.001,
(TaskRunner pid=1472490)                            'target_kl': 0.1,
(TaskRunner pid=1472490)                            'type': 'fixed'},
(TaskRunner pid=1472490)                'kl_penalty': 'kl',
(TaskRunner pid=1472490)                'lam': 1.0,
(TaskRunner pid=1472490)                'use_kl_in_reward': False},
(TaskRunner pid=1472490) DeprecationWarning: `ray.state.available_resources_per_node` is a private attribute and access will be removed in a future Ray version.
(WorkerDict pid=1474040) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
(WorkerDict pid=1474040) [rank2]:[W409 14:28:24.213069043 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 2]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
(TaskRunner pid=1472490)  'critic': {'checkpoint': {'contents': ['model', 'optimizer', 'extra']},
(TaskRunner pid=1472490)             'cliprange_value': 0.5,
(TaskRunner pid=1472490)             'forward_max_token_len_per_gpu': 32768,
(TaskRunner pid=1472490)             'forward_micro_batch_size': None,
(TaskRunner pid=1472490)             'forward_micro_batch_size_per_gpu': None,
(TaskRunner pid=1472490)             'grad_clip': 1.0,
(TaskRunner pid=1472490)             'model': {'enable_gradient_checkpointing': True,
(TaskRunner pid=1472490)                       'external_lib': None,
(TaskRunner pid=1472490)                       'fsdp_config': {'fsdp_size': -1,
(TaskRunner pid=1472490)                                       'optimizer_offload': False,
(TaskRunner pid=1472490)                                       'param_offload': False,
(TaskRunner pid=1472490)                                       'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=1472490)                       'override_config': {},
(TaskRunner pid=1472490)                       'path': '~/models/deepseek-llm-7b-chat',
(TaskRunner pid=1472490)                       'tokenizer_path': '/huggingface/Qwen2.5-0.5B',
(TaskRunner pid=1472490)                       'use_remove_padding': False},
(TaskRunner pid=1472490)             'optim': {'lr': 1e-05,
(TaskRunner pid=1472490)                       'lr_warmup_steps_ratio': 0.0,
(TaskRunner pid=1472490)                       'min_lr_ratio': None,
(TaskRunner pid=1472490)                       'total_training_steps': -1,
(TaskRunner pid=1472490)                       'warmup_style': 'constant',
(TaskRunner pid=1472490)                       'weight_decay': 0.01},
(TaskRunner pid=1472490)             'ppo_epochs': 1,
(TaskRunner pid=1472490)             'ppo_max_token_len_per_gpu': 32768,
(TaskRunner pid=1472490)             'ppo_micro_batch_size': None,
(TaskRunner pid=1472490)             'ppo_micro_batch_size_per_gpu': None,
(TaskRunner pid=1472490)             'ppo_mini_batch_size': 16,
(TaskRunner pid=1472490)             'rollout_n': 5,
(TaskRunner pid=1472490)             'shuffle': False,
(TaskRunner pid=1472490)             'strategy': 'fsdp',
(TaskRunner pid=1472490)             'ulysses_sequence_parallel_size': 1,
(TaskRunner pid=1472490)             'use_dynamic_bsz': False},
(TaskRunner pid=1472490)  'custom_reward_function': {'name': 'compute_score', 'path': None},
(TaskRunner pid=1472490)  'data': {'filter_overlong_prompts': True,
(TaskRunner pid=1472490)           'filter_overlong_prompts_workers': 1,
(TaskRunner pid=1472490)           'image_key': 'images',
(TaskRunner pid=1472490)           'max_prompt_length': 512,
(TaskRunner pid=1472490)           'max_response_length': 1024,
(TaskRunner pid=1472490)           'prompt_key': 'prompt',
(TaskRunner pid=1472490)           'return_raw_chat': False,
(TaskRunner pid=1472490)           'return_raw_input_ids': False,
(TaskRunner pid=1472490)           'reward_fn_key': 'data_source',
(TaskRunner pid=1472490)           'shuffle': True,
(TaskRunner pid=1472490)           'tokenizer': None,
(TaskRunner pid=1472490)           'train_batch_size': 64,
(TaskRunner pid=1472490)           'train_files': '/datasets/gsm8k/train.parquet',
(TaskRunner pid=1472490)           'truncation': 'error',
(TaskRunner pid=1472490)           'val_batch_size': None,
(TaskRunner pid=1472490)           'val_files': '/datasets/gsm8k/test.parquet'},
(TaskRunner pid=1472490)  'reward_model': {'enable': False,
(TaskRunner pid=1472490)                   'forward_max_token_len_per_gpu': 32768,
(TaskRunner pid=1472490)                   'max_length': None,
(TaskRunner pid=1472490)                   'micro_batch_size': None,
(TaskRunner pid=1472490)                   'micro_batch_size_per_gpu': None,
(TaskRunner pid=1472490)                   'model': {'external_lib': None,
(TaskRunner pid=1472490)                             'fsdp_config': {'fsdp_size': -1,
(TaskRunner pid=1472490)                                             'param_offload': False,
(TaskRunner pid=1472490)                                             'wrap_policy': {'min_num_params': 0}},
(TaskRunner pid=1472490)                             'input_tokenizer': '/huggingface/Qwen2.5-0.5B',
(TaskRunner pid=1472490)                             'path': '~/models/FsfairX-LLaMA3-RM-v0.1',
(TaskRunner pid=1472490)                             'use_remove_padding': False},
(TaskRunner pid=1472490)                   'reward_manager': 'naive',
(TaskRunner pid=1472490)                   'strategy': 'fsdp',
(TaskRunner pid=1472490)                   'ulysses_sequence_parallel_size': 1,
(TaskRunner pid=1472490)                   'use_dynamic_bsz': False},
(TaskRunner pid=1472490)  'trainer': {'balance_batch': True,
(TaskRunner pid=1472490)              'critic_warmup': 0,
(TaskRunner pid=1472490)              'default_hdfs_dir': None,
(TaskRunner pid=1472490)              'default_local_dir': 'checkpoints/verl_grpo_example_gsm8k/qwen25_05b_function_rm',
(TaskRunner pid=1472490)              'del_local_ckpt_after_load': False,
(TaskRunner pid=1472490)              'experiment_name': 'qwen25_05b_function_rm',
(TaskRunner pid=1472490)              'log_val_generations': 0,
(TaskRunner pid=1472490)              'logger': ['console', 'wandb'],
(TaskRunner pid=1472490)              'max_actor_ckpt_to_keep': None,
(TaskRunner pid=1472490)              'max_critic_ckpt_to_keep': None,
(TaskRunner pid=1472490)              'n_gpus_per_node': 8,
(TaskRunner pid=1472490)              'nnodes': 1,
(TaskRunner pid=1472490)              'project_name': 'verl_grpo_example_gsm8k',
(TaskRunner pid=1472490)              'resume_from_path': None,
(TaskRunner pid=1472490)              'resume_mode': 'auto',
(TaskRunner pid=1472490)              'save_freq': -1,
(TaskRunner pid=1472490)              'test_freq': 5,
(TaskRunner pid=1472490)              'total_epochs': 15,
(TaskRunner pid=1472490)              'total_training_steps': None,
(TaskRunner pid=1472490)              'val_before_train': True}}
(TaskRunner pid=1472490) [validate_config] All configuration checks passed successfully!
(TaskRunner pid=1472490) dataset len: 7473
(TaskRunner pid=1472490) filter dataset len: 7473
(TaskRunner pid=1472490) dataset len: 1319
(TaskRunner pid=1472490) filter dataset len: 1319
(TaskRunner pid=1472490) Size of train dataloader: 116
(TaskRunner pid=1472490) Total training steps: 1740
(WorkerDict pid=1474040) Monkey patch _flash_attention_forward in transformers.integrations.flash_attention
(WorkerDict pid=1473743) Model config after override: Qwen2Config {
(WorkerDict pid=1473743)   "architectures": [
(WorkerDict pid=1473743)     "Qwen2ForCausalLM"
(WorkerDict pid=1473743)   ],
(WorkerDict pid=1473743)   "attention_dropout": 0.0,
(WorkerDict pid=1473743)   "eos_token_id": 151643,
(WorkerDict pid=1473743)   "hidden_act": "silu",
(WorkerDict pid=1473743)   "hidden_size": 896,
(WorkerDict pid=1473743)   "initializer_range": 0.02,
(WorkerDict pid=1473743)   "intermediate_size": 4864,
(WorkerDict pid=1473743)   "max_position_embeddings": 32768,
(WorkerDict pid=1474049) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
(WorkerDict pid=1474051) You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(WorkerDict pid=1474051) [rank7]:[W409 14:28:26.972973382 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 7]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. [repeated 7x across cluster]
(WorkerDict pid=1474040) /home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
(WorkerDict pid=1474040)   warnings.warn(
(WorkerDict pid=1474050) Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)` [repeated 7x across cluster]
(TaskRunner pid=1472490) wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
(TaskRunner pid=1472490) wandb: Network error (SSLError), entering retry loop.
(WorkerDict pid=1474051) /home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 7x across cluster]
(WorkerDict pid=1474051)   warnings.warn( [repeated 7x across cluster]
(WorkerDict pid=1473743) LLVM ERROR: Failed to compute parent layout for slice layout.
(WorkerDict pid=1473743) *** SIGABRT received at time=1744180215 on cpu 46 ***
(WorkerDict pid=1473743) PC: @     0x7fd494c5081b  (unknown)  raise
(WorkerDict pid=1473743)     @     0x7fd494f6f5a0   54596288  (unknown)
(WorkerDict pid=1473743)     @                0x2  (unknown)  (unknown)
(WorkerDict pid=1473743)     @     0x7fa4492eba30  (unknown)  (unknown)
(WorkerDict pid=1473743) [2025-04-09 14:30:15,691 E 1473743 1473743] logging.cc:497: *** SIGABRT received at time=1744180215 on cpu 46 ***
(WorkerDict pid=1473743) [2025-04-09 14:30:15,691 E 1473743 1473743] logging.cc:497: PC: @     0x7fd494c5081b  (unknown)  raise
(WorkerDict pid=1474041) [2025-04-09 14:30:15,694 E 1474041 1474041] logging.cc:497:     @     0x7fb555e085a0  (unknown)  (unknown)
(WorkerDict pid=1473743) [2025-04-09 14:30:15,693 E 1473743 1473743] logging.cc:497:     @     0x7fd494f6f5a0   54596288  (unknown)
(WorkerDict pid=1473743) Fatal Python error: Aborted
(WorkerDict pid=1473743) 
(WorkerDict pid=1473743) Stack (most recent call first):
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 286 in make_llir
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 387 in <lambda>
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/triton/compiler/compiler.py", line 279 in compile
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 623 in run
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/triton/runtime/jit.py", line 330 in <lambda>
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/attention/ops/prefix_prefill.py", line 842 in context_attention_fwd
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/attention/ops/paged_attn.py", line 212 in forward_prefix
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/attention/backends/xformers.py", line 573 in forward
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/attention/layer.py", line 342 in unified_attention
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_ops.py", line 1123 in __call__
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/attention/layer.py", line 229 in forward
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 177 in forward
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 243 in forward
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 338 in forward
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 172 in __call__
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 462 in forward
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750 in _call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739 in _wrapped_call_impl
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1742 in execute_model
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 420 in execute_model
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/utils.py", line 2255 in run_method
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56 in collective_rpc
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 139 in execute_model
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1434 in step
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1375 in _run_engine
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 465 in generate
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/vllm/utils.py", line 1072 in inner
(WorkerDict pid=1473743)   File "/verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 235 in generate_sequences
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116 in decorate_context
(WorkerDict pid=1473743)   File "/verl/verl/workers/fsdp_workers.py", line 513 in generate_sequences
(WorkerDict pid=1473743)   File "/verl/verl/single_controller/base/decorator.py", line 404 in inner
(WorkerDict pid=1473743)   File "/verl/verl/single_controller/ray/base.py", line 419 in func
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/ray/util/tracing/tracing_helper.py", line 463 in _resume_span
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/ray/_private/function_manager.py", line 689 in actor_method_executor
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 945 in main_loop
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/ray/_private/workers/default_worker.py", line 320 in <module>
(WorkerDict pid=1473743) 
(WorkerDict pid=1473743) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, PIL._imagingft, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, msgspec._core, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, sentencepiece._sentencepiece, pyarrow._json, regex._regex, vllm.cumem_allocator, cuda_utils, __triton_launcher (total: 152)
(WorkerDict pid=1474049) 
(WorkerDict pid=1474049) 
(WorkerDict pid=1474049) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, PIL._imagingft, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, msgspec._core, zmq.backend.cython._zmq, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, sentencepiece._sentencepiece, regex._regex, pyarrow._json, vllm.cumem_allocator, cuda_utils, __triton_launcher (total: 152)
(WorkerDict pid=1474051) 
(WorkerDict pid=1474051) 
(WorkerDict pid=1474050) 
(WorkerDict pid=1474050) 
(WorkerDict pid=1474041) 
(WorkerDict pid=1474041) 
(WorkerDict pid=1474042) 
(WorkerDict pid=1474042) 
(WorkerDict pid=1474040) 
(WorkerDict pid=1474040) 
(WorkerDict pid=1474039) 
(WorkerDict pid=1474039) 

lot-insts avatar Apr 09 '25 06:04 lot-insts

when I set actor_rollout_ref.rollout.dtype=float32 \, It seems run normally, but when get the eval result, it will get error:

[36m(WorkerDict pid=1487772)[0m Monkey patch _flash_attention_forward in transformers.integrations.flash_attention[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1486765)[0m Qwen2ForCausalLM contains 494.03M parameters
[36m(WorkerDict pid=1486765)[0m Total steps: 1740, num_warmup_steps: 0
[36m(WorkerDict pid=1486765)[0m Before building vllm rollout, memory allocated (GB): 0.2300548553466797, memory reserved (GB): 2.41796875
[36m(WorkerDict pid=1487772)[0m wrap_policy: functools.partial(<function _or_policy at 0x7f9604a2ec20>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x7f9604a2eb00>, transformer_layer_cls={<class 'transformers.models.qwen2.modeling_qwen2.Qwen2DecoderLayer'>})])[32m [repeated 15x across cluster][0m
[36m(WorkerDict pid=1486765)[0m Actor use_remove_padding=True[32m [repeated 8x across cluster][0m
[36m(WorkerDict pid=1487770)[0m WARNING 04-09 14:36:42 [arg_utils.py:1854] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. 
[36m(WorkerDict pid=1487770)[0m WARNING 04-09 14:36:42 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
[36m(WorkerDict pid=1487773)[0m Monkey patch _flash_attention_forward in transformers.integrations.flash_attention[32m [repeated 8x across cluster][0m
[36m(WorkerDict pid=1487772)[0m Total steps: 1740, num_warmup_steps: 0[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1487772)[0m Actor use_remove_padding=True[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1487775)[0m NCCL version 2.21.5+cuda12.4
[36m(WorkerDict pid=1487771)[0m WARNING 04-09 14:36:42 [arg_utils.py:1854] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. [32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1487771)[0m WARNING 04-09 14:36:42 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1486765)[0m kwargs: {'n': 5, 'logprobs': 0, 'max_tokens': 1024, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}
[36m(WorkerDict pid=1486765)[0m After building vllm rollout, memory allocated (GB): 17.875736236572266, memory reserved (GB): 18.353515625
[36m(WorkerDict pid=1486765)[0m After building sharding manager, memory allocated (GB): 17.875736236572266, memory reserved (GB): 18.353515625
[36m(WorkerDict pid=1487773)[0m NCCL version 2.21.5+cuda12.4[32m [repeated 2x across cluster][0m
[36m(TaskRunner pid=1486188)[0m Using LocalLogger is deprecated. The constructor API will change 
[36m(TaskRunner pid=1486188)[0m Checkpoint tracker file does not exist: %s /checkpoints/verl_grpo_example_gsm8k/qwen25_05b_function_rm/latest_checkpointed_iteration.txt
[36m(TaskRunner pid=1486188)[0m Training from scratch
[36m(WorkerDict pid=1487773)[0m kwargs: {'n': 5, 'logprobs': 0, 'max_tokens': 1024, 'detokenize': False, 'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'ignore_eos': False}[32m [repeated 7x across cluster][0m
[36m(TaskRunner pid=1486188)[0m test_gen_batch meta info: {'eos_token_id': 151643, 'pad_token_id': 151643, 'recompute_log_prob': False, 'do_sample': False, 'validate': True}
[36m(TaskRunner pid=1486188)[0m validation generation end
[36m(TaskRunner pid=1486188)[0m [prompt] system
[36m(TaskRunner pid=1486188)[0m You are a helpful assistant.
[36m(TaskRunner pid=1486188)[0m user
[36m(TaskRunner pid=1486188)[0m Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####".
[36m(TaskRunner pid=1486188)[0m assistant
[36m(TaskRunner pid=1486188)[0m 
[36m(TaskRunner pid=1486188)[0m [response] Janet makes $2 per fresh egg, so she makes $2 x 16 = $32 every day. She eats 3 eggs every morning, so she makes $3 x 16 = $48 every day. She also bakes muffins for her friends, so she makes $48 x 7 = $336 every day. She sells the remaining eggs at the farmers' market, so she makes $336 - $32 = $304 every day.####
[36m(TaskRunner pid=1486188)[0m  cougar
……
[36m(TaskRunner pid=1486188)[0m  cougar
[36m(TaskRunner pid=1486188)[0m 
Training Progress:   0%|          | 0/1740 [00:00<?, ?it/s]
[36m(TaskRunner pid=1486188)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_compute_log_prob()[39m (pid=1487771, ip=76.64.194.36, actor_id=1f525942cd00e25086c2733101000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fa323d98490>)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/single_controller/ray/base.py", line 419, in func
[36m(TaskRunner pid=1486188)[0m     return getattr(self.worker_dict[key], name)(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/single_controller/base/decorator.py", line 404, in inner
[36m(TaskRunner pid=1486188)[0m     return func(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/fsdp_workers.py", line 540, in compute_log_prob
[36m(TaskRunner pid=1486188)[0m     output = self.actor.compute_log_prob(data=data)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/actor/dp_actor.py", line 223, in compute_log_prob
[36m(TaskRunner pid=1486188)[0m     _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/actor/dp_actor.py", line 110, in _forward_micro_batch
[36m(TaskRunner pid=1486188)[0m     output = self.actor_module(input_ids=input_ids_rmpad,
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[36m(TaskRunner pid=1486188)[0m     output = self._fsdp_wrapped_module(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
[36m(TaskRunner pid=1486188)[0m     output = func(self, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[36m(TaskRunner pid=1486188)[0m     return func(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 823, in forward
[36m(TaskRunner pid=1486188)[0m     outputs: BaseModelOutputWithPast = self.model(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
[36m(TaskRunner pid=1486188)[0m     output = func(self, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 549, in forward
[36m(TaskRunner pid=1486188)[0m     layer_outputs = decoder_layer(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[36m(TaskRunner pid=1486188)[0m     output = self._fsdp_wrapped_module(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 262, in forward
[36m(TaskRunner pid=1486188)[0m     hidden_states, self_attn_weights = self.self_attn(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 194, in forward
[36m(TaskRunner pid=1486188)[0m     attn_output, attn_weights = attention_interface(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/integrations/flash_attention.py", line 49, in flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = _flash_attention_forward(
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/models/transformers/monkey_patch.py", line 95, in _ulysses_flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = _flash_attention_forward(query_states,
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 395, in _flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = flash_attn_varlen_func(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1448, in flash_attn_varlen_func
[36m(TaskRunner pid=1486188)[0m     return FlashAttnVarlenFunc.apply(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[36m(TaskRunner pid=1486188)[0m     return super().apply(*args, **kwargs)  # type: ignore[misc]
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 930, in forward
[36m(TaskRunner pid=1486188)[0m     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
[36m(TaskRunner pid=1486188)[0m     return self._op(*args, **(kwargs or {}))
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/autograd.py", line 113, in autograd_impl
[36m(TaskRunner pid=1486188)[0m     result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/autograd.py", line 40, in forward_no_grad
[36m(TaskRunner pid=1486188)[0m     result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_ops.py", line 728, in redispatch
[36m(TaskRunner pid=1486188)[0m     return self._handle.redispatch_boxed(keyset, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 305, in backend_impl
[36m(TaskRunner pid=1486188)[0m     result = self._backend_fns[device_type](*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
[36m(TaskRunner pid=1486188)[0m     return disable_fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[36m(TaskRunner pid=1486188)[0m     return fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 337, in wrapped_fn
[36m(TaskRunner pid=1486188)[0m     return fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 170, in _flash_attn_varlen_forward
[36m(TaskRunner pid=1486188)[0m     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
[36m(TaskRunner pid=1486188)[0m RuntimeError: FlashAttention only supports Ampere GPUs or newer.
[36m(TaskRunner pid=1486188)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_compute_log_prob()[39m (pid=1487774, ip=76.64.194.36, actor_id=5888edef522f4580d115656101000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f0ec416c4f0>)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/single_controller/ray/base.py", line 419, in func
[36m(TaskRunner pid=1486188)[0m     return getattr(self.worker_dict[key], name)(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/single_controller/base/decorator.py", line 404, in inner
[36m(TaskRunner pid=1486188)[0m     return func(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/fsdp_workers.py", line 540, in compute_log_prob
[36m(TaskRunner pid=1486188)[0m     output = self.actor.compute_log_prob(data=data)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/actor/dp_actor.py", line 223, in compute_log_prob
[36m(TaskRunner pid=1486188)[0m     _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/actor/dp_actor.py", line 110, in _forward_micro_batch
[36m(TaskRunner pid=1486188)[0m     output = self.actor_module(input_ids=input_ids_rmpad,
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[36m(TaskRunner pid=1486188)[0m     output = self._fsdp_wrapped_module(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
[36m(TaskRunner pid=1486188)[0m     output = func(self, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[36m(TaskRunner pid=1486188)[0m     return func(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 823, in forward
[36m(TaskRunner pid=1486188)[0m     outputs: BaseModelOutputWithPast = self.model(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
[36m(TaskRunner pid=1486188)[0m     output = func(self, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 549, in forward
[36m(TaskRunner pid=1486188)[0m     layer_outputs = decoder_layer(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[36m(TaskRunner pid=1486188)[0m     output = self._fsdp_wrapped_module(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 262, in forward
[36m(TaskRunner pid=1486188)[0m     hidden_states, self_attn_weights = self.self_attn(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 194, in forward
[36m(TaskRunner pid=1486188)[0m     attn_output, attn_weights = attention_interface(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/integrations/flash_attention.py", line 49, in flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = _flash_attention_forward(
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/models/transformers/monkey_patch.py", line 95, in _ulysses_flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = _flash_attention_forward(query_states,
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 395, in _flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = flash_attn_varlen_func(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1448, in flash_attn_varlen_func
[36m(TaskRunner pid=1486188)[0m     return FlashAttnVarlenFunc.apply(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[36m(TaskRunner pid=1486188)[0m     return super().apply(*args, **kwargs)  # type: ignore[misc]
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 930, in forward
[36m(TaskRunner pid=1486188)[0m     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
[36m(TaskRunner pid=1486188)[0m     return self._op(*args, **(kwargs or {}))
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/autograd.py", line 113, in autograd_impl
[36m(TaskRunner pid=1486188)[0m     result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/autograd.py", line 40, in forward_no_grad
[36m(TaskRunner pid=1486188)[0m     result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_ops.py", line 728, in redispatch
[36m(TaskRunner pid=1486188)[0m     return self._handle.redispatch_boxed(keyset, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 305, in backend_impl
[36m(TaskRunner pid=1486188)[0m     result = self._backend_fns[device_type](*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
[36m(TaskRunner pid=1486188)[0m     return disable_fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[36m(TaskRunner pid=1486188)[0m     return fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 337, in wrapped_fn
[36m(TaskRunner pid=1486188)[0m     return fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 170, in _flash_attn_varlen_forward
[36m(TaskRunner pid=1486188)[0m     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
[36m(TaskRunner pid=1486188)[0m RuntimeError: FlashAttention only supports Ampere GPUs or newer.
[36m(TaskRunner pid=1486188)[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::WorkerDict.actor_rollout_compute_log_prob()[39m (pid=1487773, ip=76.64.194.36, actor_id=b40498053ecac13bf0ea2c7701000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7ef2d34d44f0>)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/single_controller/ray/base.py", line 419, in func
[36m(TaskRunner pid=1486188)[0m     return getattr(self.worker_dict[key], name)(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/single_controller/base/decorator.py", line 404, in inner
[36m(TaskRunner pid=1486188)[0m     return func(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/fsdp_workers.py", line 540, in compute_log_prob
[36m(TaskRunner pid=1486188)[0m     output = self.actor.compute_log_prob(data=data)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/actor/dp_actor.py", line 223, in compute_log_prob
[36m(TaskRunner pid=1486188)[0m     _, log_probs = self._forward_micro_batch(micro_batch, temperature=temperature)
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/workers/actor/dp_actor.py", line 110, in _forward_micro_batch
[36m(TaskRunner pid=1486188)[0m     output = self.actor_module(input_ids=input_ids_rmpad,
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[36m(TaskRunner pid=1486188)[0m     output = self._fsdp_wrapped_module(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
[36m(TaskRunner pid=1486188)[0m     output = func(self, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
[36m(TaskRunner pid=1486188)[0m     return func(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 823, in forward
[36m(TaskRunner pid=1486188)[0m     outputs: BaseModelOutputWithPast = self.model(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/utils/generic.py", line 965, in wrapper
[36m(TaskRunner pid=1486188)[0m     output = func(self, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 549, in forward
[36m(TaskRunner pid=1486188)[0m     layer_outputs = decoder_layer(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 864, in forward
[36m(TaskRunner pid=1486188)[0m     output = self._fsdp_wrapped_module(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 262, in forward
[36m(TaskRunner pid=1486188)[0m     hidden_states, self_attn_weights = self.self_attn(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[36m(TaskRunner pid=1486188)[0m     return self._call_impl(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[36m(TaskRunner pid=1486188)[0m     return forward_call(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 194, in forward
[36m(TaskRunner pid=1486188)[0m     attn_output, attn_weights = attention_interface(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/integrations/flash_attention.py", line 49, in flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = _flash_attention_forward(
[36m(TaskRunner pid=1486188)[0m   File "/verl/verl/verl/models/transformers/monkey_patch.py", line 95, in _ulysses_flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = _flash_attention_forward(query_states,
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 395, in _flash_attention_forward
[36m(TaskRunner pid=1486188)[0m     attn_output = flash_attn_varlen_func(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1448, in flash_attn_varlen_func
[36m(TaskRunner pid=1486188)[0m     return FlashAttnVarlenFunc.apply(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[36m(TaskRunner pid=1486188)[0m     return super().apply(*args, **kwargs)  # type: ignore[misc]
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 930, in forward
[36m(TaskRunner pid=1486188)[0m     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_ops.py", line 1123, in __call__
[36m(TaskRunner pid=1486188)[0m     return self._op(*args, **(kwargs or {}))
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/autograd.py", line 113, in autograd_impl
[36m(TaskRunner pid=1486188)[0m     result = forward_no_grad(*args, Metadata(keyset, keyword_only_args))
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/autograd.py", line 40, in forward_no_grad
[36m(TaskRunner pid=1486188)[0m     result = op.redispatch(keyset & _C._after_autograd_keyset, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_ops.py", line 728, in redispatch
[36m(TaskRunner pid=1486188)[0m     return self._handle.redispatch_boxed(keyset, *args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 305, in backend_impl
[36m(TaskRunner pid=1486188)[0m     result = self._backend_fns[device_type](*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
[36m(TaskRunner pid=1486188)[0m     return disable_fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[36m(TaskRunner pid=1486188)[0m     return fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/_library/custom_ops.py", line 337, in wrapped_fn
[36m(TaskRunner pid=1486188)[0m     return fn(*args, **kwargs)
[36m(TaskRunner pid=1486188)[0m   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 170, in _flash_attn_varlen_forward
[36m(TaskRunner pid=1486188)[0m     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
[36m(TaskRunner pid=1486188)[0m RuntimeError: FlashAttention only supports Ampere GPUs or newer.

Does anyone know how to slove this?

lot-insts avatar Apr 09 '25 06:04 lot-insts

Flash attention does not support fp32 computing. See here

SparkJiao avatar Apr 09 '25 08:04 SparkJiao

Thank you @SparkJiao ,but when I set ‘fp16’,it get error:

(WorkerDict pid=1474051) /home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 7x across cluster]
(WorkerDict pid=1474051)   warnings.warn( [repeated 7x across cluster]
(WorkerDict pid=1473743) LLVM ERROR: Failed to compute parent layout for slice layout.
(WorkerDict pid=1473743) *** SIGABRT received at time=1744180215 on cpu 46 ***
(WorkerDict pid=1473743) PC: @     0x7fd494c5081b  (unknown)  raise
(WorkerDict pid=1473743)     @     0x7fd494f6f5a0   54596288  (unknown)
(WorkerDict pid=1473743)     @                0x2  (unknown)  (unknown)
(WorkerDict pid=1473743)     @     0x7fa4492eba30  (unknown)  (unknown)
(WorkerDict pid=1473743) [2025-04-09 14:30:15,691 E 1473743 1473743] logging.cc:497: *** SIGABRT received at time=1744180215 on cpu 46 ***
(WorkerDict pid=1473743) [2025-04-09 14:30:15,691 E 1473743 1473743] logging.cc:497: PC: @     0x7fd494c5081b  (unknown)  raise
(WorkerDict pid=1474041) [2025-04-09 14:30:15,694 E 1474041 1474041] logging.cc:497:     @     0x7fb555e085a0  (unknown)  (unknown)
(WorkerDict pid=1473743) [2025-04-09 14:30:15,693 E 1473743 1473743] logging.cc:497:     @     0x7fd494f6f5a0   54596288  (unknown)
(WorkerDict pid=1473743) Fatal Python error: Aborted
(WorkerDict pid=1473743) 
(WorkerDict pid=1473743) Stack (most recent call first):
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 286 in make_llir
(WorkerDict pid=1473743)   File "/home/miniconda3/envs/verl/lib/python3.10/site-packages/triton/backends/nvidia/compiler.py", line 387 in <lambda>

lot-insts avatar Apr 09 '25 08:04 lot-insts

Or how can I disabling FlashAttention in all the project?

lot-insts avatar Apr 09 '25 08:04 lot-insts

你有解决不

clilyn1234 avatar Apr 12 '25 14:04 clilyn1234

你有解决不 没有

lot-insts avatar Apr 17 '25 01:04 lot-insts

https://github.com/volcengine/verl/issues/252

yuchenwang3 avatar Apr 22 '25 21:04 yuchenwang3

@lot-insts ,Is this problem solved?I had the same problem with the V100.

(WorkerDict pid=1474051) /home/miniconda3/envs/verl/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . [repeated 7x across cluster] (WorkerDict pid=1474051) warnings.warn( [repeated 7x across cluster] (WorkerDict pid=1473743) LLVM ERROR: Failed to compute parent layout for slice layout. (WorkerDict pid=1473743) *** SIGABRT received at time=1744180215 on cpu 46 *** (WorkerDict pid=1473743) PC: @ 0x7fd494c5081b (unknown) raise

mengjie09 avatar May 23 '25 07:05 mengjie09

Same issue

Huangsz2021 avatar May 28 '25 11:05 Huangsz2021

@mengjie09 I solved this, the following steps works for me, you can refer to it,

Step 1: Comment out all calls to flash_attention

  • File: /verl/verl/workers/fsdp_workers.py

Explicitly disable the following options:

            enable_chunked_prefill=False,
            enable_prefix_caching=False,

Image

Image

  • File: /verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py

There are multiple occurrences. The above is just one example.

Step 2: Install a version of the flash-attn package compatible with sm_70

  • flash-attn ≤ 1.0.8 does not include the bf16 cross-entropy kernel, so it can compile successfully on SM 70; also, flash_attn.bert_padding is a pure Python module—GPU kernels are not required even if compilation fails.
# 1. First, uninstall any previously installed (possibly auto-installed) newer versions
#    (Skip this step if already done)

pip uninstall -y flash-attn flash_attn || true

# 2. Install version 1.0.8 and specify only the compute capability 70 architecture;
#    This version only includes fp16 kernels and will not trigger bf16 ones

export TORCH_CUDA_ARCH_LIST="70"
pip install flash-attn==1.0.8 --no-build-isolation \
--config-settings="--install-option=--cuda_arch_list=70"

# 3. When launching your script, keep the existing environment variables, but **make sure to add**:

export FLASH_ATTENTION_FORCE_DISABLED=1  # Prevent verl from calling other kernels

lot-insts avatar Jun 05 '25 02:06 lot-insts

pip uninstall -y flash-attn flash_attn || true

After following your method, I encountered an error:ImportError: /home/miniconda3/envs/llm_torch26/lib/python3.10/site-packages/flash_attn_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs. Do you know how to solve them?

Wzh10032 avatar Jun 20 '25 15:06 Wzh10032

pip uninstall -y flash-attn flash_attn || true

After following your method, I encountered an error:ImportError: /home/miniconda3/envs/llm_torch26/lib/python3.10/site-packages/flash_attn_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs. Do you know how to solve them?

请问解决了吗,我也遇到了这个问题

wonderNefelibata avatar Aug 22 '25 05:08 wonderNefelibata