Jamly7 comments

Results 12 comments of


                                            Jamly7

run sh: `torchrun --nproc_per_node 2 --nnodes 2 --node_rank 0 --master_addr 127.0.0.1 /root/lh/swift/swift/cli/sft.py --model_type qwen1half-7b-chat --model_id_or_path /mnt/model_repository/Qwen1.5-7B-Chat/ --dataset /root/lh/data2.jsonl --output_dir /root/lh/output/ --add_output_dir_suffix false --deepspeed default-zero3 --ddp_backend=nccl` W0625 08:58:08.766000 140195703509632 torch/distributed/run.py:757] W0625...

多机多卡训练出现问题

有命令么？我看教程没有写

多机多卡训练出现问题

单卡把模型加载到内存指的是部署吗？（swift deploy），我先部署后启动主从节点的微调命令，依旧是卡在加载模型（[INFO:swift] Loading the model using model_dir: /mnt/model_repository/Qwen1.5-7B-Chat）

多机多卡训练出现问题

多机多卡对硬件或者网络有什么要求吗？

多机多卡训练出现问题

在经过30min的等待，time_out后，发现是socket链接失败，后续发现可能是网卡配置问题，加了两个配置后，顺利进入模型加载环节，export NCCL_SOCKET_IFNAME=eth0 export NCCL_IB_DISABLE=1；暂不清楚后续是否有其他问题

多机多卡训练出现问题

后续有报错，在配置了NCCL_SOCKET_IFNAME后，进入训练流程时，nccl尝试连接主节点的36791socket，但是连接错误。

多机多卡训练出现问题

NCCL WARN socketProgressOpt: Call to recv from 192.168.1.43 failed : Broken pipe。 [rank2]:[E ProcessGroupNCCL.cpp:577] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of...

增加layeroutlmv3的调参教程

![layerout](https://github.com/user-attachments/assets/d9598a79-ae06-40ea-8811-2a673fff2852)