llama icon indicating copy to clipboard operation
llama copied to clipboard

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)

Open hopto-dot opened this issue 1 year ago • 21 comments

I'm trying to run the 7B model on an rtx 3090 (24gb) on WSL Ubuntu but I'm getting the following error:

jawgboi@DESKTOP-SLIQCDH:~/git/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir "./model/7B" --tokenizer_path "./model/tokenizer.model"
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 25586) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
example.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-03_23:20:30
  host      : DESKTOP-SLIQCDH.
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 25586)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 25586
======================================================

I have tried:

  1. Changing torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
  2. Adding .cuda().half() to the end of model = Transformer(model_args)
  3. Changing the 32 in max_batch_size: int = 32, to 8

hopto-dot avatar Mar 03 '23 23:03 hopto-dot

Did you enable CUDA in WSL? https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl

neuhaus avatar Mar 04 '23 12:03 neuhaus

Have you tried to set CUDA_VISIBLE_DEVICES=0?

dmitry avatar Mar 04 '23 17:03 dmitry

What memory limit you have in .wslconfig ? I believe you need to make this value big enough so torch can load things before moving them to GPU, also try to specify the device you will be using during inference. Example : torchrun --nproc_per_node 1 example.py --ckpt_dir ./7B --tokenizer_path ./tokenizer.model --device 0

mtb0x1 avatar Mar 05 '23 15:03 mtb0x1

What memory limit you have in .wslconfig ? I believe you need to make this value big enough so torch can load things before moving them to GPU, also try to specify the device you will be using during inference. Example : torchrun --nproc_per_node 1 example.py --ckpt_dir ./7B --tokenizer_path ./tokenizer.model --device 0

Yeah same issue increase the ram as it was not enough to first load it in RAM and then load to GPU

gamingflexer avatar Mar 06 '23 10:03 gamingflexer

Same error, I modified the max_batch_szie to 1, still get this error

kli017 avatar Mar 07 '23 06:03 kli017

Found solution yet?

capripio avatar Mar 10 '23 18:03 capripio

use this notebook i have tried on it

https://colab.research.google.com/drive/1ESttkeO8Ww2--8dlNLIGuEoG4qhP_r96?usp=sharing

gamingflexer avatar Mar 11 '23 02:03 gamingflexer

Same issue here! any solution?

update: there is no issue for me anymore by using 2 a6000 and 100GB memory for 7B and 13B models with MP=2.

aliaraabi avatar Mar 14 '23 18:03 aliaraabi

Same issue also here , thx for the help

crypto-maniac avatar Mar 24 '23 12:03 crypto-maniac

same issue, anyone has solution?

raytions avatar Mar 31 '23 14:03 raytions

I am running into similar issue using A100 GPU

SujoyDutta avatar Mar 31 '23 19:03 SujoyDutta

I'm also having this issue.

nanghtet avatar Apr 04 '23 04:04 nanghtet

I had the same issue when using T4 * 4 on docker container. I cloud solve the problem by increasing shared memory size 64mb (default) -> 8gb Example: docker run --shm-size=8gb Or if you use docker-compose.yaml:

services:
  your_service:
    shm_size: '8gb'

Tyaba avatar Apr 21 '23 08:04 Tyaba

Have you tried modifying .wslconfig file for more memory and more processors? It works for me.

BoxiangW avatar Apr 26 '23 06:04 BoxiangW

What memory limit you have in .wslconfig ? I believe you need to make this value big enough so torch can load things before moving them to GPU, also try to specify the device you will be using during inference. Example : torchrun --nproc_per_node 1 example.py --ckpt_dir ./7B --tokenizer_path ./tokenizer.model --device 0

Yeah same issue increase the ram as it was not enough to first load it in RAM and then load to GPU

Can I ask what is the smallest RAM required? I've tried 12GB but still no luck

jzhu382 avatar May 01 '23 03:05 jzhu382

colab runtime choose GPU =>runtime shape choose high RAM will solve this problem

frankchieng avatar Jul 23 '23 15:07 frankchieng

I had the same issue, and I solved it by increasing .wslconfig to memory=16GB and processors=8 (I think it can be reduced). This message appears : Loaded in 87.83 seconds

pierrebelin avatar Aug 11 '23 06:08 pierrebelin

colab 运行时选择 GPU =>运行时形状选择高 RAM 将解决此问题

Can you speak more clearly

threeneedone avatar Sep 27 '23 07:09 threeneedone

I'm trying to run the 7B model on an rtx 3090 (24gb) on WSL Ubuntu but I'm getting the following error:

jawgboi@DESKTOP-SLIQCDH:~/git/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir "./model/7B" --tokenizer_path "./model/tokenizer.model"
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 25586) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
example.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-03_23:20:30
  host      : DESKTOP-SLIQCDH.
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 25586)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 25586
======================================================

I have tried:

1. Changing `torch.distributed.init_process_group("nccl")` to `torch.distributed.init_process_group("gloo")`

2. Adding `.cuda().half()` to the end of `model = Transformer(model_args)`

3. Changing the 32 in `max_batch_size: int = 32,` to `8`

Did you check htop (or the like) if you do not simply run out of memory?

feinmann avatar Oct 08 '23 06:10 feinmann

same error here. anyone fix it?

xiaxin1998 avatar Mar 19 '24 01:03 xiaxin1998