对Qwen3-VL-32B-Instruct做GRPO报错 IndexError
机器资源:2节点16卡H20
命令:
nnodes=2
nproc_per_node=8
export CUDA_LAUNCH_BLOCKING=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
MODEL_PATH=$(readlink -f output/models/Qwen3-VL-32B-Instruct/sft)
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=$nproc_per_node \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
NNODES=$nnodes \
swift rlhf \
--rlhf_type grpo \
--model $MODEL_PATH \
--external_plugins train/plugin.py \
--multi_turn_scheduler d2c_scheduler \
--reward_funcs d2c_reward \
--use_vllm true \
--vllm_mode colocate \
--sleep_level 1 \
--offload_optimizer true \
--offload_model true \
--train_type full \
--torch_dtype bfloat16 \
--dataset 'train/data/grpo_train.jsonl' \
--load_from_cache_file true \
--split_dataset_ratio 0.01 \
--max_completion_length 32768 \
--num_train_epochs 2 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-6 \
--gradient_accumulation_steps 1 \
--save_strategy 'steps' \
--eval_strategy 'steps' \
--eval_steps 100 \
--save_steps 1000 \
--save_total_limit 10 \
--logging_steps 1 \
--output_dir output/models/Qwen3-VL-32B-Instruct/grpo \
--warmup_ratio 0.01 \
--dataloader_num_workers 1 \
--num_generations 16 \
--temperature 1.0 \
--deepspeed zero3 \
--move_model_batches 8 \
--add_version false \
--create_checkpoint_symlink true \
--log_completions true
报错:
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/cli/rlhf.py", line 7, in <module>
[rank0]: rlhf_main()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/train/rlhf.py", line 233, in rlhf_main
[rank0]: return SwiftRLHF(args).main()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/ray/base.py", line 170, in wrapper
[rank0]: return func(self, *args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/train/sft.py", line 197, in run
[rank0]: trainer = trainer_cls(
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/grpo_trainer.py", line 84, in __init__
[rank0]: self.prepare_rollout()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rollout_mixin.py", line 86, in prepare_rollout
[rank0]: self._prepare_vllm()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rollout_mixin.py", line 168, in _prepare_vllm
[rank0]: self.engine = self._prepare_vllm_engine()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/trainers/rlhf_trainer/rollout_mixin.py", line 212, in _prepare_vllm_engine
[rank0]: engine = GRPOVllmEngine(
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/infer/infer_engine/grpo_vllm_engine.py", line 61, in __init__
[rank0]: super().__init__(
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/infer/infer_engine/vllm_engine.py", line 140, in __init__
[rank0]: self._prepare_engine()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/llm/infer/infer_engine/vllm_engine.py", line 150, in _prepare_engine
[rank0]: engine = llm_engine_cls.from_engine_args(self.engine_args)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 510, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 112, in from_vllm_config
[rank0]: return cls(vllm_config=vllm_config,
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/llm_engine.py", line 92, in __init__
[rank0]: self.engine_core = EngineCoreClient.make_client(
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 75, in make_client
[rank0]: return InprocClient(vllm_config, executor_class, log_stats)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 198, in __init__
[rank0]: self.engine_core = EngineCore(*args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 64, in __init__
[rank0]: self.model_executor = executor_class(vllm_config)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 122, in _init_executor
[rank0]: self.collective_rpc("load_model")
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/utils.py", line 2456, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 162, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1332, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
[rank0]: model = _initialize_model(vllm_config=vllm_config)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 151, in __init__
[rank0]: old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 377, in __init__
[rank0]: self.model = TransformersModel(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/vllm/model_executor/models/transformers.py", line 143, in __init__
[rank0]: self.model: PreTrainedModel = AutoModel.from_config(
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 456, in from_config
[rank0]: return model_class._from_config(config, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2308, in _from_config
[rank0]: model = cls(config, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank0]: f(module, *args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 898, in __init__
[rank0]: self.language_model = Qwen3VLTextModel._from_config(config.text_config)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 277, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2311, in _from_config
[rank0]: model = cls(config, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank0]: f(module, *args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/models/qwen3_vl/modeling_qwen3_vl.py", line 780, in __init__
[rank0]: self.post_init()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2111, in post_init
[rank0]: self.init_weights()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3651, in init_weights
[rank0]: self.initialize_weights()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2984, in initialize_weights
[rank0]: self.smart_apply(self._initialize_weights)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2977, in smart_apply
[rank0]: module.smart_apply(fn)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2978, in smart_apply
[rank0]: fn(self)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2952, in _initialize_weights
[rank0]: self._init_weights(module)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2929, in _init_weights
[rank0]: module.weight.data[module.padding_idx].zero_()
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/utils/_device.py", line 104, in __torch_function__
[rank0]: return func(*args, **kwargs)
[rank0]: IndexError: index 151643 is out of bounds for dimension 0 with size 0
[rank0]:[W1118 15:22:36.227975182 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
terminate called after throwing an instance of 'c10::Error'
what(): Trying to free a pointer not allocated here
Exception raised from raw_delete at /workspace/code/torch260/pytorch/torch/csrc/cuda/CUDAPluggableAllocator.cpp:151 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2f3b195206 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x60 (0x7f2f3b13e805 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: torch::cuda::CUDAPluggableAllocator::CUDAPluggableAllocator::raw_delete(void*) + 0x237 (0x7f2ee661d997 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0x1f966 (0x7f2f41b75966 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x2005e (0x7f2f41b7605e in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x3849f (0x7f2f41b8e49f in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: c10::cuda::MemPool::~MemPool() + 0x1b2 (0x7f2f41b77bb2 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #7: <unknown function> + 0xce025a (0x7f2f3a2e025a in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x4519d0 (0x7f2f39a519d0 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x452011 (0x7f2f39a52011 in /opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: /checkpoint/binary/train_package/python_bin() [0x486e02]
frame #11: /checkpoint/binary/train_package/python_bin() [0x481adb]
frame #12: /checkpoint/binary/train_package/python_bin() [0x45d80d]
frame #13: /checkpoint/binary/train_package/python_bin() [0x45e868]
frame #14: /checkpoint/binary/train_package/python_bin() [0x45e7ad]
frame #15: /checkpoint/binary/train_package/python_bin() [0x486f68]
frame #16: /checkpoint/binary/train_package/python_bin() [0x45d80d]
frame #17: /checkpoint/binary/train_package/python_bin() [0x48883b]
frame #18: /checkpoint/binary/train_package/python_bin() [0x551ebe]
frame #19: /checkpoint/binary/train_package/python_bin() [0x552aea]
frame #20: /checkpoint/binary/train_package/python_bin() [0x5293ee]
frame #21: Py_BytesMain + 0x5f (0x42942f in /checkpoint/binary/train_package/python_bin)
frame #22: __libc_start_main + 0xf2 (0x7f2f4821ba72 in /lib64/libc.so.6)
frame #23: _start + 0x2e (0x42808e in /checkpoint/binary/train_package/python_bin)
W1118 15:22:38.115000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1783 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1784 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1785 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1786 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1787 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1788 closing signal SIGTERM
W1118 15:22:38.116000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1789 closing signal SIGTERM
E1118 15:22:47.058000 1714 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 7 (pid: 1790) of binary: /checkpoint/binary/train_package/python_bin
Traceback (most recent call last):
File "/opt/conda/envs/python3.10.13/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/python3.10.13/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in <module>
main()
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/opt/conda/envs/python3.10.13/lib/python3.10/site-packages/swift/cli/rlhf.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-11-18_15:22:38
host : xdl-2e1ef101e5c5-worker-0
rank : 7 (local_rank: 7)
exitcode : -11 (pid: 1790)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1790
============================================================
--------------------------------------------------------------------------------
在8个rank中都会报错相同的 IndexError: index 151643 is out of bounds for dimension 0 with size 0
已尝试的修改:
- 删除 move_model_batches,相同报错
- 删除 sleep_level、offload_optimizer、offload_model,相同报错
- zero3 更改为 zero2,报错 OOM
请问为什么会出现 IndexError 的错误?如何解决?
plz provide env infos
plz provide env infos
OS: Linux 5.10.134-010.ali5000.al8.x86_64 Python 版本: 3.10.13+gc (heads/release/3.10.13-inc_gc:866f61ca61, Jun 13 2025, 02:25:33) [GCC 13.3.1 20240611 (Red Hat 13.3.1-2)] NVCC 版本: Cuda compilation tools, release 12.8, V12.8.61 PyTorch CUDA 版本: 12.8 PyTorch cuDNN 版本: 91002 可用 GPU 数量: 8 GPU 0: NVIDIA H20 GPU 1: NVIDIA H20 GPU 2: NVIDIA H20 GPU 3: NVIDIA H20 GPU 4: NVIDIA H20 GPU 5: NVIDIA H20 GPU 6: NVIDIA H20 GPU 7: NVIDIA H20
ms-swift 3.10.1 transformers 4.57.1 accelerate 1.7.0 peft 0.17.1 torch 2.6.0 torchvision 0.21.0 torchaudio 2.6.0 transformers 4.57.1 accelerate 1.7.0 deepspeed 0.14.5 peft 0.17.1 numpy 1.26.4 pandas 2.3.0 scikit-learn 1.7.0 huggingface-hub 0.36.0 datasets 3.6.0 modelscope 1.31.0 opencv-python 4.8.0.74 deepspeed 0.14.5 vLLM 0.8.5.post1+cu128
vLLM 0.8.5.post1+cu128
qwen3-vl models is supported in vllm 0.11.0
vLLM 0.8.5.post1+cu128
qwen3-vl models is supported in vllm 0.11.0
Thanks! I'll try