llama
llama copied to clipboard
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)
I'm trying to run the 7B model on an rtx 3090 (24gb) on WSL Ubuntu but I'm getting the following error:
jawgboi@DESKTOP-SLIQCDH:~/git/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir "./model/7B" --tokenizer_path "./model/tokenizer.model"
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loading
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 25586) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
example.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-03_23:20:30
host : DESKTOP-SLIQCDH.
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 25586)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 25586
======================================================
I have tried:
- Changing
torch.distributed.init_process_group("nccl")
totorch.distributed.init_process_group("gloo")
- Adding
.cuda().half()
to the end ofmodel = Transformer(model_args)
- Changing the 32 in
max_batch_size: int = 32,
to8
Did you enable CUDA in WSL? https://learn.microsoft.com/en-us/windows/ai/directml/gpu-cuda-in-wsl
Have you tried to set CUDA_VISIBLE_DEVICES=0
?
What memory limit you have in .wslconfig ? I believe you need to make this value big enough so torch can load things before moving them to GPU, also try to specify the device you will be using during inference.
Example : torchrun --nproc_per_node 1 example.py --ckpt_dir ./7B --tokenizer_path ./tokenizer.model --device 0
What memory limit you have in .wslconfig ? I believe you need to make this value big enough so torch can load things before moving them to GPU, also try to specify the device you will be using during inference. Example :
torchrun --nproc_per_node 1 example.py --ckpt_dir ./7B --tokenizer_path ./tokenizer.model --device 0
Yeah same issue increase the ram as it was not enough to first load it in RAM and then load to GPU
Same error, I modified the max_batch_szie to 1, still get this error
Found solution yet?
use this notebook i have tried on it
https://colab.research.google.com/drive/1ESttkeO8Ww2--8dlNLIGuEoG4qhP_r96?usp=sharing
Same issue here! any solution?
update: there is no issue for me anymore by using 2 a6000 and 100GB memory for 7B and 13B models with MP=2.
Same issue also here , thx for the help
same issue, anyone has solution?
I am running into similar issue using A100 GPU
I'm also having this issue.
I had the same issue when using T4 * 4 on docker container.
I cloud solve the problem by increasing shared memory size 64mb (default) -> 8gb
Example: docker run --shm-size=8gb
Or if you use docker-compose.yaml:
services:
your_service:
shm_size: '8gb'
Have you tried modifying .wslconfig file for more memory and more processors? It works for me.
What memory limit you have in .wslconfig ? I believe you need to make this value big enough so torch can load things before moving them to GPU, also try to specify the device you will be using during inference. Example :
torchrun --nproc_per_node 1 example.py --ckpt_dir ./7B --tokenizer_path ./tokenizer.model --device 0
Yeah same issue increase the ram as it was not enough to first load it in RAM and then load to GPU
Can I ask what is the smallest RAM required? I've tried 12GB but still no luck
colab runtime choose GPU =>runtime shape choose high RAM will solve this problem
I had the same issue, and I solved it by increasing .wslconfig to memory=16GB
and processors=8
(I think it can be reduced).
This message appears : Loaded in 87.83 seconds
colab 运行时选择 GPU =>运行时形状选择高 RAM 将解决此问题
Can you speak more clearly
I'm trying to run the 7B model on an rtx 3090 (24gb) on WSL Ubuntu but I'm getting the following error:
jawgboi@DESKTOP-SLIQCDH:~/git/llama$ torchrun --nproc_per_node 1 example.py --ckpt_dir "./model/7B" --tokenizer_path "./model/tokenizer.model" > initializing model parallel with size 1 > initializing ddp with size 1 > initializing pipeline with size 1 Loading ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 25586) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 762, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== example.py FAILED ------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-03_23:20:30 host : DESKTOP-SLIQCDH. rank : 0 (local_rank: 0) exitcode : -9 (pid: 25586) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 25586 ======================================================
I have tried:
1. Changing `torch.distributed.init_process_group("nccl")` to `torch.distributed.init_process_group("gloo")` 2. Adding `.cuda().half()` to the end of `model = Transformer(model_args)` 3. Changing the 32 in `max_batch_size: int = 32,` to `8`
Did you check htop
(or the like) if you do not simply run out of memory?
same error here. anyone fix it?