GLM-130B
GLM-130B copied to clipboard
分布式训练error,求各位跑通的大佬赐教
error信息如下:
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 4 and model-parallel size: 4
> padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
> initializing model parallel with size 4
> Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
global rank 3 is loading checkpoint ./model_tp/49300/mp_rank_03_model_states.pt
global rank 2 is loading checkpoint ./model_tp/49300/mp_rank_02_model_states.ptglobal rank 0 is loading checkpoint ./model_tp/49300/mp_rank_00_model_states.pt
global rank 1 is loading checkpoint ./model_tp/49300/mp_rank_01_model_states.pt
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52478 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52480 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52481 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 52479) of binary: /home/lu/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
File "/home/lu/anaconda3/envs/glm130b/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
/home2/lu/GLM-130B/generate.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-05_17:50:19
host : guest-server
rank : 1 (local_rank: 1)
exitcode : -9 (pid: 52479)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 52479
======================================================
我用的4卡V100(32G)训练的 环境配置:torch 1.12.1 / cuda 11.3 / transformers 4.27.1
哈喽,想问下你是用哪个脚本训练的?我还没有找到训练的代码在哪
+1
+1,我看跑的时候机器的内存满了
有人解决没,卡在这里了,目前看内存肯定是不够了
同求训练的代码在哪里找
解决了吗