ascend 910b,chatglm2做全量微调报错
Reminder
- [X] I have read the README and searched the existing issues.
Reproduction
bug 如下图
Expected behavior
No response
System Info
No response
Others
[INFO|modeling_utils.py:4170] 2024-05-20 17:25:15,119 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[INFO|modeling_utils.py:4178] 2024-05-20 17:25:15,119 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at /root/.cache/modelscope/hub/ZhipuAI/chatglm3-6b.
If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training.
[INFO|modeling_utils.py:3719] 2024-05-20 17:25:15,124 >> Generation config file not found, using a generation config created from the model config.
05/20/2024 17:25:15 - INFO - llamafactory.model.utils.checkpointing - Gradient checkpointing enabled.
05/20/2024 17:25:15 - INFO - llamafactory.model.utils.attention - Using vanilla attention implementation.
05/20/2024 17:25:15 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
05/20/2024 17:25:15 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
05/20/2024 17:25:15 - INFO - llamafactory.model.loader - trainable params: 1949696 || all params: 6245533696 || trainable%: 0.0312
[INFO|trainer.py:626] 2024-05-20 17:25:15,521 >> Using auto half precision backend
[INFO|trainer.py:2048] 2024-05-20 17:25:15,881 >> ***** Running training *****
[INFO|trainer.py:2049] 2024-05-20 17:25:15,881 >> Num examples = 1,000
[INFO|trainer.py:2050] 2024-05-20 17:25:15,881 >> Num Epochs = 3
[INFO|trainer.py:2051] 2024-05-20 17:25:15,882 >> Instantaneous batch size per device = 2
[INFO|trainer.py:2054] 2024-05-20 17:25:15,882 >> Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:2055] 2024-05-20 17:25:15,882 >> Gradient Accumulation steps = 8
[INFO|trainer.py:2056] 2024-05-20 17:25:15,882 >> Total optimization steps = 186
[INFO|trainer.py:2057] 2024-05-20 17:25:15,884 >> Number of trainable parameters = 1,949,696
0%| | 0/186 [00:00<?, ?it/s]Traceback (most recent call last):
File "/data/anaconda3/envs/llama_factory/bin/llamafactory-cli", line 8, in
@hunterhome chatglm使用了torch.jit,torch-npu不支持,可以把对应的torch.jit装饰器注释掉
cc @belle9217
补充一点信息,不要在报错的 ~/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py 取消 torch.jit 装饰器注释;要在 modelscope 或 huggingface下载的 repo 里修改对应文件,比如modelscope的是 ~/.cache/modelscope/hub/ZhipuAI/chatglm3-6b/modeling_chatglm.py
谢谢! 目前跑了25%时,报以下错误: it]Traceback (most recent call last): File "/data/LLaMA-Factory/src/llamafactory/launcher.py", line 9, in <module> launch() File "/data/LLaMA-Factory/src/llamafactory/launcher.py", line 5, in launch run_exp() File "/data/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/data/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3238, in training_step loss = self.compute_loss(model, inputs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3264, in compute_loss outputs = model(**inputs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward else self._run_ddp_forward(*inputs, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward return self.module(*inputs, **kwargs) # type: ignore[index] File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward return model_forward(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in call return convert_to_fp32(self.model_forward(*args, **kwargs)) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast return func(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward return self.base_model( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward return self.model.forward(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 941, in forward transformer_outputs = self.transformer( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 834, in forward hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 631, in forward layer_ret = torch.utils.checkpoint.checkpoint( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner return torch._dynamo.disable(fn, recursive)(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn return fn(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner return fn(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 489, in checkpoint ret = function(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 544, in forward attention_output, kv_cache = self.self_attention( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 408, in forward query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb) File "/root/.cache/huggingface/modules/transformers_modules/chatglm3-6b/modeling_chatglm.py", line 169, in apply_rotary_pos_emb rope_cache = rope_cache.view(sq, -1, 1, xshaped.size(3), 2) RuntimeError: shape '[13024, -1, 1, 32, 2]' is invalid for input of size 524288 [2024-05-31 13:24:51,107] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345687 closing signal SIGTERM [2024-05-31 13:24:51,107] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345688 closing signal SIGTERM [2024-05-31 13:24:51,108] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345690 closing signal SIGTERM [2024-05-31 13:24:51,110] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345691 closing signal SIGTERM [2024-05-31 13:24:51,114] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345692 closing signal SIGTERM [2024-05-31 13:24:51,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345693 closing signal SIGTERM [2024-05-31 13:24:51,116] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1345694 closing signal SIGTERM Exception in thread Thread-2: Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError Exception in thread Thread-2: Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/repository_manager/utils/multiprocess_util.py", line 91, in run key, func, args, kwargs = self.task_q.get(timeout=TIMEOUT) File "<string>", line 2, in get File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/managers.py", line 818, in _callmethod kind, result = conn.recv() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/data/anaconda3/envs/llama_factory/lib/python3.10/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError [2024-05-31 13:24:58,118] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,139] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,156] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,170] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,173] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,179] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,179] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,188] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,196] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,198] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,210] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,218] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,222] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,236] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,237] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,241] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,241] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,246] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,277] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,287] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,287] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,306] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,320] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,321] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,400] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,423] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,451] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,457] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,464] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,498] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,529] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,541] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,546] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,615] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,622] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,644] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,659] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,723] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,741] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,742] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,776] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,783] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,804] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,857] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:58,896] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:59,163] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:59,178] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:24:59,322] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to npu (auto detect) [2024-05-31 13:25:21,117] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345687 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:21,409] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345688 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:21,743] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345690 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:22,067] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345691 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:22,353] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345692 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:22,754] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345693 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:23,183] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 1345694 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL [2024-05-31 13:25:23,720] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 1345689) of binary: /data/anaconda3/envs/llama_factory/bin/python Traceback (most recent call last): File "/data/anaconda3/envs/llama_factory/bin/torchrun", line 8, in <module> sys.exit(main()) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/data/LLaMA-Factory/src/llamafactory/launcher.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-05-31_13:24:51 host : localhost.localdomain rank : 2 (local_rank: 2) exitcode : 1 (pid: 1345689) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------ 原始邮件 ------------------ 发件人: "hiyouga/LLaMA-Factory" @.>; 发送时间: 2024年5月28日(星期二) 中午1:39 @.>; @.@.>; 主题: Re: [hiyouga/LLaMA-Factory] ascend 910b,chatglm2做全量微调报错 (Issue #3788)
Closed #3788 as completed.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
谢谢! 目前跑了25%时,报以下错误: it]Traceback (most recent call last): File "/data/LLaMA-Factory/src/llam 这个错误最后怎么解决的呀