GLM-130B icon indicating copy to clipboard operation
GLM-130B copied to clipboard

分布式训练error,求各位跑通的大佬赐教

Open xiaoweiweixiao opened this issue 3 years ago • 6 comments

error信息如下:

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
WARNING: No training data specified
using world size: 4 and model-parallel size: 4 
> padded vocab (size: 150528) with 0 dummy tokens (new size: 150528)
> initializing model parallel with size 4
> Set tokenizer as a icetk-glm-130B tokenizer! Now you can get_tokenizer() everywhere.
global rank 3 is loading checkpoint ./model_tp/49300/mp_rank_03_model_states.pt
global rank 2 is loading checkpoint ./model_tp/49300/mp_rank_02_model_states.ptglobal rank 0 is loading checkpoint ./model_tp/49300/mp_rank_00_model_states.pt
global rank 1 is loading checkpoint ./model_tp/49300/mp_rank_01_model_states.pt

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52478 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52480 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 52481 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 52479) of binary: /home/lu/anaconda3/envs/glm130b/bin/python
Traceback (most recent call last):
  File "/home/lu/anaconda3/envs/glm130b/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==1.12.1', 'console_scripts', 'torchrun')())
  File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/lu/anaconda3/envs/glm130b/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
/home2/lu/GLM-130B/generate.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-04-05_17:50:19
  host      : guest-server
  rank      : 1 (local_rank: 1)
  exitcode  : -9 (pid: 52479)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 52479
======================================================

我用的4卡V100(32G)训练的 环境配置:torch 1.12.1 / cuda 11.3 / transformers 4.27.1

xiaoweiweixiao avatar Apr 06 '23 01:04 xiaoweiweixiao

哈喽,想问下你是用哪个脚本训练的?我还没有找到训练的代码在哪

Dagoli avatar Apr 14 '23 05:04 Dagoli

+1

SnakeHacker avatar May 11 '23 09:05 SnakeHacker

+1,我看跑的时候机器的内存满了

kaixinjiuhao123 avatar May 19 '23 08:05 kaixinjiuhao123

有人解决没,卡在这里了,目前看内存肯定是不够了

kaixinjiuhao123 avatar May 19 '23 08:05 kaixinjiuhao123

同求训练的代码在哪里找

kaixinjiuhao123 avatar May 22 '23 09:05 kaixinjiuhao123

解决了吗

jweihe avatar Mar 22 '24 11:03 jweihe