SwissArmyTransformer
SwissArmyTransformer copied to clipboard
“No backend type associated with device type cpu” when run cli_demo_sat.py
Traceback (most recent call last):
File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
main()
File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
model, model_args = AutoModel.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 367, in from_pretrained
mp_split_model_receive(model, use_node_group=use_node_group)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 91, in mp_split_model_receive
iter_repartition(model)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
iter_repartition(sub_module)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
iter_repartition(sub_module)
File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 84, in iter_repartition
torch.distributed.recv(sub_module.weight.data, src)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1640, in recv
pg.recv([tensor], src, tag).wait()
RuntimeError: No backend type associated with device type cpu
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-03-05 14:32:43,744] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50878 closing signal SIGTERM
原来能跑起来的,现在又不行了,是sat又更新了吗? 目前版本torch=2.1.2,sat=0.4.11,transformers=4.38.2