LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

ascend 910b,chatglm2做全量微调报错

Open belle9217 opened this issue 1 year ago • 1 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

bug 如下图

Expected behavior

No response

System Info

No response

Others

image

belle9217 avatar May 17 '24 06:05 belle9217

[INFO|modeling_utils.py:4170] 2024-05-20 17:25:15,119 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:4178] 2024-05-20 17:25:15,119 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /root/.cache/modelscope/hub/ZhipuAI/chatglm3-6b. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|modeling_utils.py:3719] 2024-05-20 17:25:15,124 >> Generation config file not found, using a generation config created from the model config. 05/20/2024 17:25:15 - INFO - llamafactory.model.utils.checkpointing - Gradient checkpointing enabled. 05/20/2024 17:25:15 - INFO - llamafactory.model.utils.attention - Using vanilla attention implementation. 05/20/2024 17:25:15 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32. 05/20/2024 17:25:15 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA 05/20/2024 17:25:15 - INFO - llamafactory.model.loader - trainable params: 1949696 || all params: 6245533696 || trainable%: 0.0312 [INFO|trainer.py:626] 2024-05-20 17:25:15,521 >> Using auto half precision backend [INFO|trainer.py:2048] 2024-05-20 17:25:15,881 >> ***** Running training ***** [INFO|trainer.py:2049] 2024-05-20 17:25:15,881 >> Num examples = 1,000 [INFO|trainer.py:2050] 2024-05-20 17:25:15,881 >> Num Epochs = 3 [INFO|trainer.py:2051] 2024-05-20 17:25:15,882 >> Instantaneous batch size per device = 2 [INFO|trainer.py:2054] 2024-05-20 17:25:15,882 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:2055] 2024-05-20 17:25:15,882 >> Gradient Accumulation steps = 8 [INFO|trainer.py:2056] 2024-05-20 17:25:15,882 >> Total optimization steps = 186 [INFO|trainer.py:2057] 2024-05-20 17:25:15,884 >> Number of trainable parameters = 1,949,696 0%| | 0/186 [00:00<?, ?it/s]Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/data/LLaMA-Factory/src/llamafactory/cli.py", line 65, in main run_exp() File "/data/LLaMA-Factory/src/llamafactory/train/tuner.py", line 34, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/data/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3138, in training_step loss = self.compute_loss(model, inputs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3161, in compute_loss outputs = model(**inputs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward return model_forward(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward return self.base_model( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward return self.model.forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 941, in forward transformer_outputs = self.transformer( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 834, in forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 631, in forward layer_ret = torch.utils.checkpoint.checkpoint( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint ret = function(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward attention_output, kv_cache = self.self_attention( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 408, in forward query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb) NotImplementedError: Unknown device for graph fuser

hunterhome avatar May 20 '24 09:05 hunterhome

@hunterhome chatglm使用了torch.jit,torch-npu不支持,可以把对应的torch.jit装饰器注释掉

ji-huazhong avatar May 26 '24 13:05 ji-huazhong

image

ji-huazhong avatar May 27 '24 00:05 ji-huazhong

cc @belle9217

ji-huazhong avatar May 27 '24 05:05 ji-huazhong

补充一点信息,不要在报错的 ~/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py 取消 torch.jit 装饰器注释;要在 modelscope 或 huggingface下载的 repo 里修改对应文件,比如modelscope的是 ~/.cache/modelscope/hub/ZhipuAI/chatglm3-6b/modeling_chatglm.py

MengqingCao avatar May 28 '24 03:05 MengqingCao

谢谢! 目前跑了25%时,报以下错误: it]Traceback (most recent call last):   File "/data/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module>     launch()   File "/data/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch     run_exp()   File "/data/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)   File "/data/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train     return inner_training_loop(   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop     tr_loss_step = self.training_step(model, inputs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step     loss = self.compute_loss(model, inputs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3264, in compute_loss     outputs = model(**inputs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl     return self._call_impl(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl     return forward_call(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward     else self._run_ddp_forward(*inputs, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward     return self.module(*inputs, **kwargs)  # type: ignore[index]   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl     return self._call_impl(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl     return forward_call(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward     return model_forward(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call     return convert_to_fp32(self.model_forward(*args, **kwargs))   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast     return func(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward     return self.base_model(   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl     return self._call_impl(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl     return forward_call(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward     return self.model.forward(*args, **kwargs)   File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 941, in forward     transformer_outputs = self.transformer(   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl     return self._call_impl(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl     return forward_call(*args, **kwargs)   File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 834, in forward     hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl     return self._call_impl(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl     return forward_call(*args, **kwargs)   File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 631, in forward     layer_ret = torch.utils.checkpoint.checkpoint(   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner     return torch._dynamo.disable(fn, recursive)(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn     return fn(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner     return fn(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint     ret = function(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl     return self._call_impl(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl     return forward_call(*args, **kwargs)   File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward     attention_output, kv_cache = self.self_attention(   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl     return self._call_impl(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl     return forward_call(*args, **kwargs)   File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 408, in forward     query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb)   File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 169, in apply_rotary_pos_emb     rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2) RuntimeError: shape '[13024, -1, 1, 32, 2]' is invalid for input of size 524288 [2024-05-31 13:24:51,107] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345687 closing signal SIGTERM [2024-05-31 13:24:51,107] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345688 closing signal SIGTERM [2024-05-31 13:24:51,108] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345690 closing signal SIGTERM [2024-05-31 13:24:51,110] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345691 closing signal SIGTERM [2024-05-31 13:24:51,114] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345692 closing signal SIGTERM [2024-05-31 13:24:51,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345693 closing signal SIGTERM [2024-05-31 13:24:51,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345694 closing signal SIGTERM Exception in thread Thread-2: Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner     self.run()   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run     key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)   File "<string>", line 2, in get   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod     kind, result = conn.recv()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv     buf = self._recv_bytes()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes     buf = self._recv(4)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv     raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner     self.run()   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run     key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)   File "<string>", line 2, in get   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod     kind, result = conn.recv()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv     buf = self._recv_bytes()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes     buf = self._recv(4)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv     raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner     self.run()   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run     key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)   File "<string>", line 2, in get   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod     kind, result = conn.recv()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv     buf = self._recv_bytes()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes     buf = self._recv(4)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv     raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner     self.run()   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run     key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)   File "<string>", line 2, in get   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod     kind, result = conn.recv()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv     buf = self._recv_bytes()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes     buf = self._recv(4)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv     raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner     self.run()   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run     key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)   File "<string>", line 2, in get   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod     kind, result = conn.recv()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv     buf = self._recv_bytes()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes     buf = self._recv(4)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv     raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner     self.run()   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run     key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)   File "<string>", line 2, in get   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod     kind, result = conn.recv()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv     buf = self._recv_bytes()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes     buf = self._recv(4)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv     raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner     self.run()   File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run     key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT)   File "<string>", line 2, in get   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod     kind, result = conn.recv()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv     buf = self._recv_bytes()   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes     buf = self._recv(4)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv     raise EOFError EOFError [2024-05-31 13:24:58,118] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,170] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,173] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,179] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,179] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,188] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,196] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,198] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,210] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,218] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,236] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,241] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,241] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,246] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,277] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,287] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,287] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,306] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,400] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,423] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,451] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,457] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,464] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,498] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,529] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,541] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,546] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,615] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,622] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,644] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,659] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,723] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,741] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,742] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,776] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,783] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,804] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,857] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,896] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:59,163] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:59,178] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:59,322] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:25:21,117] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345687 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:21,409] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345688 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:21,743] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345690 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:22,067] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345691 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:22,353] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345692 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:22,754] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345693 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:23,183] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345694 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:23,720] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 1345689) of binary: /data/anaconda3/envs/llama_factory/bin/python Traceback (most recent call last):   File "/data/anaconda3/envs/llama_factory/bin/torchrun", line 8, in <module>     sys.exit(main())   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper     return f(*args, **kwargs)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main     run(args)   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run     elastic_launch(   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call     return launch_agent(self._config, self._entrypoint, list(args))   File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent     raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

/data/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures:   <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]:   time      : 2024-05-31_13:24:51   host      : localhost.localdomain   rank      : 2 (local_rank: 2)   exitcode  : 1 (pid: 1345689)   error_file: <N/A>   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

------------------ 原始邮件 ------------------ 发件人: "hiyouga/LLaMA-Factory" @.>; 发送时间: 2024年5月28日(星期二) 中午1:39 @.>; @.@.>; 主题: Re: [hiyouga/LLaMA-Factory] ascend 910b,chatglm2做全量微调报错 (Issue #3788)

Closed #3788 as completed.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hunterhome avatar May 31 '24 06:05 hunterhome

谢谢! 目前跑了25%时,报以下错误: it]Traceback (most recent call last):   File "/data/LLaMA-Factory/src/llam 这个错误最后怎么解决的呀

wphtrying avatar Jul 23 '24 14:07 wphtrying