SwissArmyTransformer icon indicating copy to clipboard operation
SwissArmyTransformer copied to clipboard

“No backend type associated with device type cpu” when run cli_demo_sat.py

Open yileld opened this issue 11 months ago • 5 comments

Traceback (most recent call last):
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
    main()
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
    model, model_args = AutoModel.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 367, in from_pretrained
    mp_split_model_receive(model, use_node_group=use_node_group)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 91, in mp_split_model_receive
    iter_repartition(model)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 84, in iter_repartition
    torch.distributed.recv(sub_module.weight.data, src)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1640, in recv
    pg.recv([tensor], src, tag).wait()
RuntimeError: No backend type associated with device type cpu
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-03-05 14:32:43,744] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50878 closing signal SIGTERM

原来能跑起来的,现在又不行了,是sat又更新了吗? 目前版本torch=2.1.2,sat=0.4.11,transformers=4.38.2

yileld avatar Mar 05 '24 06:03 yileld