ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

AttributeError: 'Tensor' object has no attribute 'row'

Open singing4you opened this issue 4 months ago • 2 comments

报错信息如下:

[rank3]:[I1118 16:21:40.081921327 ProcessGroupWrapper.cpp:586] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=ALLGATHER, TensorShape=[348031], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I1118 16:21:40.082491969 ProcessGroupWrapper.cpp:586] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=ALLGATHER, TensorShape=[348031], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1118 16:21:40.082541750 ProcessGroupWrapper.cpp:586] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=ALLGATHER, TensorShape=[348031], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
Warning: The current version of the file storing weights is old, and it is relanded due to internal bug of torch and compatibility issue. We will deprecate the loading support for this type of file in the future, please use newer torch to re-store the weight file.
/home/ma-user/anaconda3/envs/vllm_infer/lib/python3.10/site-packages/torch_npu/utils/storage.py:88: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
[rank3]:[I1118 16:21:41.200706571 ProcessGroupWrapper.cpp:586] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I1118 16:21:41.205902164 ProcessGroupWrapper.cpp:586] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank2]:[I1118 16:21:41.214111945 ProcessGroupWrapper.cpp:586] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank5]:[I1118 16:21:41.221912069 ProcessGroupWrapper.cpp:586] [Rank 5] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank4]:[I1118 16:21:41.234877395 ProcessGroupWrapper.cpp:586] [Rank 4] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank6]:[I1118 16:21:41.242638628 ProcessGroupWrapper.cpp:586] [Rank 6] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1118 16:21:41.250905371 ProcessGroupWrapper.cpp:586] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank7]:[I1118 16:21:41.280036085 ProcessGroupWrapper.cpp:586] [Rank 7] Running collective: CollectiveFingerPrint(SequenceNumber=6561, OpType=ALLGATHER, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank5]:[I1118 16:21:41.385560215 ProcessGroupWrapper.cpp:586] [Rank 5] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank3]:[I1118 16:21:41.387166107 ProcessGroupWrapper.cpp:586] [Rank 3] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank2]:[I1118 16:21:41.388811609 ProcessGroupWrapper.cpp:586] [Rank 2] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I1118 16:21:41.389791488 ProcessGroupWrapper.cpp:586] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I1118 16:21:41.393597964 ProcessGroupWrapper.cpp:586] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank6]:[I1118 16:21:41.398648173 ProcessGroupWrapper.cpp:586] [Rank 6] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank4]:[I1118 16:21:41.399400048 ProcessGroupWrapper.cpp:586] [Rank 4] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank7]:[I1118 16:21:41.401116242 ProcessGroupWrapper.cpp:586] [Rank 7] Running collective: CollectiveFingerPrint(SequenceNumber=6562, OpType=ALLGATHER, TensorShape=[88], TensorDtypes=Byte, TensorDeviceTypes=TensorOptions(dtype=float (default), device=npu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[INFO:swift] last_model_checkpoint: None
[INFO:swift] best_model_checkpoint: None
[INFO:swift] images_dir: /home/ma-user/modelarts/outputs/SAVE_URL_0/v0-20251118-160819/images
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/cli/rlhf.py", line 5, in <module>
[rank3]:     rlhf_main()
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/train/rlhf.py", line 217, in rlhf_main
[rank3]:     return SwiftRLHF(args).main()
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/base.py", line 49, in main
[rank3]:     result = self.run()
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/train/sft.py", line 195, in run
[rank3]:     return self.train(trainer)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/train/sft.py", line 243, in train
[rank3]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/mixin.py", line 794, in train
[rank3]:     res = super().train(*args, **kwargs)
[rank3]:   File "/home/ma-user/anaconda3/envs/vllm_infer/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in train
[rank3]:     return inner_training_loop(
[rank3]:   File "/home/ma-user/anaconda3/envs/vllm_infer/lib/python3.10/site-packages/transformers/trainer.py", line 2672, in _inner_training_loop
[rank3]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 2013, in training_step
[rank3]:     return super().training_step(model, inputs, num_items_in_batch)
[rank3]:   File "/home/ma-user/anaconda3/envs/vllm_infer/lib/python3.10/site-packages/transformers/trainer.py", line 4003, in training_step
[rank3]:     inputs = self._prepare_inputs(inputs)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/utils.py", line 170, in wrapper
[rank3]:     return func(self, *args, **kwargs)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 436, in _prepare_inputs
[rank3]:     generation_batch = self._generate_and_score_completions(generation_batch)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/utils.py", line 170, in wrapper
[rank3]:     return func(self, *args, **kwargs)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 862, in _generate_and_score_completions
[rank3]:     batch_encoded_inputs = self._prepare_batch_inputs(inputs)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/utils.py", line 170, in wrapper
[rank3]:     return func(self, *args, **kwargs)
[rank3]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1354, in _prepare_batch_inputs
[rank3]:     self._metrics[mode]['completions/mean_length'].append(total_lengths.mean().row())
[rank3]: AttributeError: 'Tensor' object has no attribute 'row'
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/cli/rlhf.py", line 5, in <module>
[rank1]:     rlhf_main()
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/train/rlhf.py", line 217, in rlhf_main
[rank1]:     return SwiftRLHF(args).main()
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/base.py", line 49, in main
[rank1]:     result = self.run()
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/train/sft.py", line 195, in run
[rank1]:     return self.train(trainer)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/llm/train/sft.py", line 243, in train
[rank1]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/mixin.py", line 794, in train
[rank1]:     res = super().train(*args, **kwargs)
[rank1]:   File "/home/ma-user/anaconda3/envs/vllm_infer/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in train
[rank1]:     return inner_training_loop(
[rank1]:   File "/home/ma-user/anaconda3/envs/vllm_infer/lib/python3.10/site-packages/transformers/trainer.py", line 2672, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 2013, in training_step
[rank1]:     return super().training_step(model, inputs, num_items_in_batch)
[rank1]:   File "/home/ma-user/anaconda3/envs/vllm_infer/lib/python3.10/site-packages/transformers/trainer.py", line 4003, in training_step
[rank1]:     inputs = self._prepare_inputs(inputs)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/utils.py", line 170, in wrapper
[rank1]:     return func(self, *args, **kwargs)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 436, in _prepare_inputs
[rank1]:     generation_batch = self._generate_and_score_completions(generation_batch)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/utils.py", line 170, in wrapper
[rank1]:     return func(self, *args, **kwargs)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 862, in _generate_and_score_completions
[rank1]:     batch_encoded_inputs = self._prepare_batch_inputs(inputs)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/utils.py", line 170, in wrapper
[rank1]:     return func(self, *args, **kwargs)
[rank1]:   File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1354, in _prepare_batch_inputs
[rank1]:     self._metrics[mode]['completions/mean_length'].append(total_lengths.mean().row())
[rank1]: AttributeError: 'Tensor' object has no attribute 'row'

硬件:Ascend-910B4 python: 3.10 torch/torch_npu:2.8.0/2.8.0rc1

使用colocate 模式训练报错,日志截取的命令

/home/ma-user/anaconda3/envs/vllm_infer/bin/python3.1 -m torch.distributed.run  \
    --nproc_per_node 8 \
    --master_port 1234 \
    --nnodes 16 \
    --node_rank 0 \
    --master_addr ma-job-2cda15cd-9fed-4207-a22c-8a4b82ae69d3-worker-0.ma-job-2cda15cd-9fed-4207-a22c-8a4b82ae69d3 \
    /home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/cli/rlhf.py \
    --attn_impl flash_attn \
    --beta 0 \
    --dataloader_num_workers 4 \
    --deepspeed zero3 \
    --dataset /home/ma-user/modelarts/inputs/DATA_URL_0 \
    --dataset_shuffle True \
    --dynamic_sample false \
    --epsilon_high 0.28 \
    --external_plugins examples/train/grpo/plugin/spn_agent.py \
    --generation_batch_size 384 \
    --gradient_accumulation_steps 3 \
    --importance_sampling_level token \
    --learning_rate 3e-4 \
    --log_completions true \
    --log_entropy true \
    --logging_steps 1 \
    --loss_type bnpo \
    --lr_scheduler_type cosine \
    --max_completion_length 2048 \
    --model /home/ma-user/modelarts/inputs/MODEL_URL_1 \
    --model_type qwq \
    --move_model_batches 32 \
    --multi_turn_scheduler tool_call_scheduler \
    --num_train_epochs 50 \
    --num_generations 8 \
    --num_iterations 1 \
    --output_dir /home/ma-user/modelarts/outputs/SAVE_URL_0 \
    --overlong_filter true \
    --padding_free true \
    --per_device_eval_batch_size 1 \
    --per_device_train_batch_size 1 \
    --repetition_penalty 1 \
    --reward_funcs external_single_turn_spn_agent_reward \
    --reward_weights 1.0 \
    --rlhf_type grpo \
    --save_steps 1 \
    --save_total_limit 100 \
    --sequence_parallel_size 4 \
    --temperature 0.8 \
    --torch_dtype bfloat16 \
    --train_type lora \
    --use_vllm true \
    --vllm_mode colocate \
    --sleep_level 1 \
    --offload_optimizer true \
    --offload_model false \
    --vllm_tensor_parallel_size 8 \
    --vllm_gpu_memory_utilization 0.7 \
    --vllm_mm_processor_cache_gb 0 \
    --vllm_max_model_len 32768 \
    --vllm_enable_prefix_caching True \
    --warmup_ratio 0 \
    --max_turns 10

singing4you avatar Nov 18 '25 08:11 singing4you

File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1354, in _prepare_batch_inputs [rank1]: self._metrics[mode]['completions/mean_length'].append(total_lengths.mean().row())

是不是改源码了

https://github.com/modelscope/ms-swift/blob/v3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py#L1354

hjh0119 avatar Nov 18 '25 08:11 hjh0119

File "/home/ma-user/modelarts/user-job-dir/ms-swift-3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1354, in _prepare_batch_inputs [rank1]: self._metrics[mode]['completions/mean_length'].append(total_lengths.mean().row())

是不是改源码了

https://github.com/modelscope/ms-swift/blob/v3.9.2/swift/trainers/rlhf_trainer/grpo_trainer.py#L1354

没有,不过我是之前拉的代码,好像是main分支拉的,可能原来的代码有bug,后面被修复了。我更新一下代码试试

singing4you avatar Nov 18 '25 10:11 singing4you