ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

对Qwen3-VL-32B-Instruct做GRPO报错 IndexError

Open Qiao0124 opened this issue 1 month ago • 4 comments

机器资源:2节点16卡H20

命令:

nnodes=2
nproc_per_node=8
export CUDA_LAUNCH_BLOCKING=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

MODEL_PATH=$(readlink -f output/models/Qwen3-VL-32B-Instruct/sft)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=$nproc_per_node \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
NNODES=$nnodes \
swift rlhf \
    --rlhf_type grpo \
    --model $MODEL_PATH \
    --external_plugins train/plugin.py \
    --multi_turn_scheduler d2c_scheduler \
    --reward_funcs d2c_reward \
    --use_vllm true \
    --vllm_mode colocate \
    --sleep_level 1 \
    --offload_optimizer true \
    --offload_model true \
    --train_type full \
    --torch_dtype bfloat16 \
    --dataset 'train/data/grpo_train.jsonl' \
    --load_from_cache_file true \
    --split_dataset_ratio 0.01 \
    --max_completion_length 32768 \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-6 \
    --gradient_accumulation_steps 1 \
    --save_strategy 'steps' \
    --eval_strategy 'steps' \
    --eval_steps 100 \
    --save_steps 1000 \
    --save_total_limit 10 \
    --logging_steps 1 \
    --output_dir output/models/Qwen3-VL-32B-Instruct/grpo \
    --warmup_ratio 0.01 \
    --dataloader_num_workers 1 \
    --num_generations 16 \
    --temperature 1.0 \
    --deepspeed zero3 \
    --move_model_batches 8 \
    --add_version false \
    --create_checkpoint_symlink true \
    --log_completions true 

报错:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/cli/rlhf.py", line 7, in <module>
[rank0]:     rlhf_main()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank0]:     return SwiftRLHF(args).main()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank0]:     result = self.run()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank0]:     return func(self, *args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/train/sft.py", line 197, in run
[rank0]:     trainer = trainer_cls(
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/grpo_trainer.py", line 84, in __init__
[rank0]:     self.prepare_rollout()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rollout_mixin.py", line 86, in prepare_rollout
[rank0]:     self._prepare_vllm()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rollout_mixin.py", line 168, in _prepare_vllm
[rank0]:     self.engine = self._prepare_vllm_engine()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rollout_mixin.py", line 212, in _prepare_vllm_engine
[rank0]:     engine = GRPOVllmEngine(
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/infer/infer_engine/grpo_vllm_engine.py", line 61, in __init__
[rank0]:     super().__init__(
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/infer/infer_engine/vllm_engine.py", line 140, in __init__
[rank0]:     self._prepare_engine()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/infer/infer_engine/vllm_engine.py", line 150, in _prepare_engine
[rank0]:     engine = llm_engine_cls.from_engine_args(self.engine_args)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 510, in from_engine_args
[rank0]:     return engine_cls.from_vllm_config(
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 112, in from_vllm_config
[rank0]:     return cls(vllm_config=vllm_config,
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 92, in __init__
[rank0]:     self.engine_core = EngineCoreClient.make_client(
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 75, in make_client
[rank0]:     return InprocClient(vllm_config, executor_class, log_stats)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 198, in __init__
[rank0]:     self.engine_core = EngineCore(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 64, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 122, in _init_executor
[rank0]:     self.collective_rpc("load_model")
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/utils.py", line 2456, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 162, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1332, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
[rank0]:     model = _initialize_model(vllm_config=vllm_config)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]:     return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 151, in __init__
[rank0]:     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 377, in __init__
[rank0]:     self.model = TransformersModel(vllm_config=vllm_config, prefix=prefix)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 143, in __init__
[rank0]:     self.model: PreTrainedModel = AutoModel.from_config(
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 456, in from_config
[rank0]:     return model_class._from_config(config, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2308, in _from_config
[rank0]:     model = cls(config, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank0]:     f(module, *args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 898, in __init__
[rank0]:     self.language_model = Qwen3VLTextModel._from_config(config.text_config)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2311, in _from_config
[rank0]:     model = cls(config, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank0]:     f(module, *args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 780, in __init__
[rank0]:     self.post_init()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2111, in post_init
[rank0]:     self.init_weights()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3651, in init_weights
[rank0]:     self.initialize_weights()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2984, in initialize_weights
[rank0]:     self.smart_apply(self._initialize_weights)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2977, in smart_apply
[rank0]:     module.smart_apply(fn)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2978, in smart_apply
[rank0]:     fn(self)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2952, in _initialize_weights
[rank0]:     self._init_weights(module)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2929, in _init_weights
[rank0]:     module.weight.data[module.padding_idx].zero_()
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]:     return func(*args, **kwargs)
[rank0]: IndexError: index 151643 is out of bounds for dimension 0 with size 0
[rank0]:[W1118 15:22:36.227975182 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
terminate called after throwing an instance of 'c10::Error'
  what():  Trying to free a pointer not allocated here
Exception raised from raw_delete at /workspace/code/torch260/pytorch/torch/csrc/cuda/CUDAPluggableAllocator.cpp:151 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2f3b195206 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x60 (0x7f2f3b13e805 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator::raw_delete(void*) + 0x237 (0x7f2ee661d997 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x1f966 (0x7f2f41b75966 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2005e (0x7f2f41b7605e in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x3849f (0x7f2f41b8e49f in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: c10::cuda::MemPool::~MemPool() + 0x1b2 (0x7f2f41b77bb2 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #7: <unknown function> + 0xce025a (0x7f2f3a2e025a in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x4519d0 (0x7f2f39a519d0 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x452011 (0x7f2f39a52011 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: /checkpoint/binary/train_package/python_bin() [0x486e02]
frame #11: /checkpoint/binary/train_package/python_bin() [0x481adb]
frame #12: /checkpoint/binary/train_package/python_bin() [0x45d80d]
frame #13: /checkpoint/binary/train_package/python_bin() [0x45e868]
frame #14: /checkpoint/binary/train_package/python_bin() [0x45e7ad]
frame #15: /checkpoint/binary/train_package/python_bin() [0x486f68]
frame #16: /checkpoint/binary/train_package/python_bin() [0x45d80d]
frame #17: /checkpoint/binary/train_package/python_bin() [0x48883b]
frame #18: /checkpoint/binary/train_package/python_bin() [0x551ebe]
frame #19: /checkpoint/binary/train_package/python_bin() [0x552aea]
frame #20: /checkpoint/binary/train_package/python_bin() [0x5293ee]
frame #21: Py_BytesMain + 0x5f (0x42942f in /checkpoint/binary/train_package/python_bin)
frame #22: __libc_start_main + 0xf2 (0x7f2f4821ba72 in /lib64/libc.so.6)
frame #23: _start + 0x2e (0x42808e in /checkpoint/binary/train_package/python_bin)

W1118 15:22:38.115000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1783 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1784 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1785 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1786 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1787 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1788 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1789 closing signal SIGTERM

E1118 15:22:47.058000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 7 (pid: 1790) of binary: /checkpoint/binary/train_package/python_bin
Traceback (most recent call last):
  File "/opt/conda/envs/python3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/python3.10.13/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
    main()
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-11-18_15:22:38
  host      : xdl-2e1ef101e5c5-worker-0
  rank      : 7 (local_rank: 7)
  exitcode  : -11 (pid: 1790)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 1790
============================================================
--------------------------------------------------------------------------------

在8个rank中都会报错相同的 IndexError: index 151643 is out of bounds for dimension 0 with size 0

已尝试的修改:

  • 删除 move_model_batches,相同报错
  • 删除 sleep_level、offload_optimizer、offload_model,相同报错
  • zero3 更改为 zero2,报错 OOM

请问为什么会出现 IndexError 的错误?如何解决?

Qiao0124 avatar Nov 18 '25 07:11 Qiao0124

plz provide env infos

hjh0119 avatar Nov 18 '25 07:11 hjh0119

plz provide env infos

OS: Linux 5.10.134-010.ali5000.al8.x86_64 Python 版本: 3.10.13+gc (heads/release/3.10.13-inc_gc:866f61ca61, Jun 13 2025, 02:25:33) [GCC 13.3.1 20240611 (Red Hat 13.3.1-2)] NVCC 版本: Cuda compilation tools, release 12.8, V12.8.61 PyTorch CUDA 版本: 12.8 PyTorch cuDNN 版本: 91002 可用 GPU 数量: 8 GPU 0: NVIDIA H20 GPU 1: NVIDIA H20 GPU 2: NVIDIA H20 GPU 3: NVIDIA H20 GPU 4: NVIDIA H20 GPU 5: NVIDIA H20 GPU 6: NVIDIA H20 GPU 7: NVIDIA H20

ms-swift 3.10.1 transformers 4.57.1 accelerate 1.7.0 peft 0.17.1 torch 2.6.0 torchvision 0.21.0 torchaudio 2.6.0 transformers 4.57.1 accelerate 1.7.0 deepspeed 0.14.5 peft 0.17.1 numpy 1.26.4 pandas 2.3.0 scikit-learn 1.7.0 huggingface-hub 0.36.0 datasets 3.6.0 modelscope 1.31.0 opencv-python 4.8.0.74 deepspeed 0.14.5 vLLM 0.8.5.post1+cu128

Qiao0124 avatar Nov 18 '25 08:11 Qiao0124

vLLM 0.8.5.post1+cu128

qwen3-vl models is supported in vllm 0.11.0

hjh0119 avatar Nov 18 '25 08:11 hjh0119

vLLM 0.8.5.post1+cu128

qwen3-vl models is supported in vllm 0.11.0

Thanks! I'll try

Qiao0124 avatar Nov 18 '25 08:11 Qiao0124