ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

多节点训练报错

Open zhangfan-algo opened this issue 9 months ago • 0 comments

Describe the bug 2024-05-16 14:19:20 [W socket.cpp:697] [c10d] The IPv6 network addresses of (zf-yi1-5-34b-sft-0516-02-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known). 2024-05-16 14:19:35 Traceback (most recent call last): 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/study_info/swift_0516/examples/pytorch/llm/llm_sft.py", line 2, in 2024-05-16 14:19:35 import custom 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/study_info/swift_0516/examples/pytorch/llm/custom.py", line 5, in 2024-05-16 14:19:35 from modelscope import AutoConfig, AutoModelForCausalLM, AutoTokenizer, MsDataset 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/init.py", line 4, in 2024-05-16 14:19:35 from modelscope.utils.import_utils import LazyImportModule 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/utils/init.py", line 1, in 2024-05-16 14:19:35 from .hub import create_model_if_not_exist, read_config 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/utils/hub.py", line 12, in 2024-05-16 14:19:35 from modelscope.utils.config import Config 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/modelscope/utils/config.py", line 19, in 2024-05-16 14:19:35 from yapf.yapflib.yapf_api import FormatCode 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/init.py", line 41, in 2024-05-16 14:19:35 from yapf.yapflib import yapf_api 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/yapflib/yapf_api.py", line 38, in 2024-05-16 14:19:35 from yapf.pyparser import pyparser 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/pyparser/pyparser.py", line 44, in 2024-05-16 14:19:35 from yapf.yapflib import format_token 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/yapflib/format_token.py", line 23, in 2024-05-16 14:19:35 from yapf.pytree import pytree_utils 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf/pytree/pytree_utils.py", line 30, in 2024-05-16 14:19:35 from yapf_third_party._ylib2to3 import pygram 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf_third_party/_ylib2to3/pygram.py", line 39, in 2024-05-16 14:19:35 pattern_grammar = driver.load_grammar(_PATTERN_GRAMMAR_FILE) 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf_third_party/_ylib2to3/pgen2/driver.py", line 252, in load_grammar 2024-05-16 14:19:35 g.load(gp) 2024-05-16 14:19:35 File "/mnt/pfs/zhangfan/system/test/conda/envs/swift/lib/python3.10/site-packages/yapf_third_party/_ylib2to3/pgen2/grammar.py", line 95, in load 2024-05-16 14:19:35 d = pickle.load(f) 2024-05-16 14:19:35 EOFError: Ran out of input

Your hardware and system info

torchrun --nproc_per_node ${num_gpu_per_node} --master_port $MASTER_PORT --master_addr $MASTER_ADDR --node_rank $RANK --nnodes $WORLD_SIZE examples/pytorch/llm/llm_sft.py
--model_cache_dir /mnt/pfs/zhangfan/models/01-ai/Yi-1.5-34B-Chat
--model_type yi-1_5-34b-chat
--sft_type full
--tuner_backend swift
--template_type AUTO
--output_dir output/test
--ddp_backend nccl
--custom_train_dataset_path train_classfiy.jsonl
--dataset_test_ratio 0.03
--self_cognition_sample -1
--preprocess_num_proc 60
--dataloader_num_workers 60
--train_dataset_sample -1
--dataset_test_ratio 0.01
--lr_scheduler_type cosine
--num_train_epochs 5
--save_total_limit 10
--save_strategy epoch
--evaluation_strategy steps
--eval_steps 50
--logging_steps 10
--batch_size 1
--eval_batch_size 1
--max_length 17000
--check_dataset_strategy warning
--gradient_checkpointing true
--gradient_accumulation_steps 8
--weight_decay 0.01
--learning_rate 1e-5
--max_grad_norm 0.5
--warmup_ratio 0.03
--use_flash_attn true
--push_to_hub false
--deepspeed_config_path ds_z2_config.json
--save_only_model false
--save_on_each_node false
--lazy_tokenize true
--lisa_activated_layers 8
--lisa_step_interval 20
--neftune_noise_alpha 10
--dtype AUTO

zhangfan-algo avatar May 16 '24 06:05 zhangfan-algo