fsdp_qlora
fsdp_qlora copied to clipboard
ProcessExitedException: process 0 (2x 4090)
I'm trying what looks like the "Hello World" of this repo: Running the basic training on a Runpod community cloud 2 x RTX 4090, (128 vCPU 125 GB RAM)
configuration. Normally I'd play around with this for longer before posting an issue, but since Runpod was mentioned explicitly in the Answer.ai intro post, I figure this will be the simplest path for anybody trying to test this out.
On their runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04
pod:
python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 2 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--reentrant_checkpointing true \
--log_to wandb
Download the Llama-2 mode, sets everything up, and dies with the following backtrace:
Traceback (most recent call last):
File "/root/fsdp_qlora/train.py", line 939, in <module>
def main(
File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 125, in call_parse
return _f()
File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 119, in _f
return tfunc(**merge(args, args_from_prog(func, xtra)))
File "/root/fsdp_qlora/train.py", line 1010, in main
mp.spawn(fsdp_main,
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Log:
1 Creating model 0
2 Loading model 0
3 Model created 0 1.119 GB
4 trainable params: 744,488,960 || all params: 35,495,616,512 || trainable%: 2.097410985236193
5 Wrapping model w/ FSDP 0
6 Wrapped model 0 1.444 GB
7 Applying activation checkpointing 0
8 Total Training Steps: 12940
9 Epoch 0, Loss 0.000: 0%| | 0/12940 [00:00<?, ?it/s]
Here's the W&B run.
I haven't found any indicators as to what's going on. Both System and GPU ram seem well within bounds, so I'm not sure why it's dying (unless maybe 125ּGB system ram is not enough, and getting blown through instantaneously before it's visible on nvitop
or the W&B log?)
I suppose you are using SSH to login and run the training. The likely issue with that is that when your SSH session disconnects it will terminate processes that you started. The approach to have it running even when you disconnect is to use something like tmux.
I ran into similar issue and here is what I did
tmux new -s session_name
# run all the commands
When you ssh again and need to attach to same session you can do the following
#Confirm that your session is still alive
tmux ls
tmux attach -t session_name
when you want to detach from tmux session you can use Ctrl+b & d
P.S: Something like nohup
with &
at end of command doesn't work (in my experience) as torch has its own termination handler
I appreciate the thoughtful response, but in this case I don't think that's what's happening. Firstly I'm remaining connected with a live session (remember this is crashing before it even gets to run one training step) . I'm also running this inside of a screen
session, which has the same basic behavior as tmux
.
Hopefully I'll have a chance to work more on debugging this today.
@zabirauf
I'm also facing the same issue.
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Also, I run my script using the screen command.
Thanks.
I think it will be helpful to figure out how much System RAM is needed to do something like Llama 70B. The basic 2x4090 Runpod gives you 125 GB
. When I tried the same command on a 4x4090 with 251 GB RAM
the training started without a hitch. I don't think this is due to the extra 2 GPUs, because each GPU is only at about 40% memory usage.
I might try another test using the same 251 GB RAM
machine, but using only 2 of the GPUs to confirm this.
I am trying to run this project using 2x4090 and find out that the same issue occured.
My error message was torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL
When I changed my RAM setting 64GB to 200 GB, repository train.py works fine. I think it needs much more RAM memory described in README.md. Maybe between 128GB to 200GB
@SangJunni I have 190 GB RAM but still the same issue.
Thanks.
@sanipanwala Can you allocate more than 200GB RAM then test to run train.py? I am working this project with private server using SSH connection.
@SangJunni I also used a private server and I think 190 GB is enough to run this program.
Thanks.
Running the example with a 70B model on 2 GPUs currently uses just over 128GB RAM, so systems with less than that will have issues. I've updated the readme to add a note on this and one potential fix: on my machine (128GB CPU RAM + 2x3090) I create a 10GB swapfile and this is enough to handle the RAM spike during loading before the training starts and it settles into a lower steady-state usage.
That makes perfect sense, thank you. Maybe I'll put that into a gist for starting this up on Runpod.
@johnowhitaker ,
I'm using the below command.
python train.py --model_name codellama/CodeLlama-70b-hf --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset sql --reentrant_checkpointing true
Do you know why loss is always nan
? I'm using "knowrohit07/know_sql"
dataset.
Thanks.
@sanipanwala how soon does the issue occur? I have that running now (I only had a llama-2-70b fine-tune not codellama handy) and so far the loss isn't nan. And perhaps you can share your environment so we can try to replicate?
@johnowhitaker Thanks for your reply.
It shows the beginning of the training loss is nan. I waited for a few hours but the loss value was the same. Even I tried to change the learning rate but the loss is the same.
I'm using my private computer, so it is not possible to share that environment.
Even you can try the below command on your side.
python train.py --model_name codellama/CodeLlama-70b-hf --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset sql --reentrant_checkpointing true
Thanks.
@johnowhitaker
I have changed the precisions (bf16 to fp32 and other options) and learning rate but still have the same issue.
If you have any suggestions then please let me know.
Thanks.
@sanipanwala can you try with llama 7B to see if it's codellama-specific or something else that's going wrong? I ran your command on the models I had handy and none gave nan losses.
I'm using my private computer, so it is not possible to share that environment. Sorry, by 'share your environment' I meant tell us what versions of the libraries are installed so we can see if it might be related to that. SPecifically these libraries:
accelerate
bitsandbytes
datasets
hqq
hqq-aten
huggingface-hub
llama-recipes
peft
safetensors
tokenizers
torch
transformers
If you've already tried lowering the learning rate I'm not sure what else could be the issue.
Hi @johnowhitaker
Thank you for your reply.
I have successfully tried the Llama 7B model, and it's working fine. However, when I attempted to run the 70B model, the loss consistently showed as NaN.
Below are the libraries and their versions that I have installed:
accelerate 0.27.2
bitsandbytes 0.43.0
datasets 2.18.0
hqq (I haven't installed)
hqq-aten (I haven't installed)
huggingface-hub 0.21.4
llama-recipes (I haven't installed)
peft 0.9.0
safetensors 0.4.2
tokenizers 0.15.2
torch 2.2.1+cu118
transformers 4.38.2
Thanks.
Hi @johnowhitaker ,
Please let me know if any other information you require.
Thanks.
Have you solved this problem? I have encountered the same problem and am still solving it. I would like to consult with you on how to resolve the issue? I have all the packages mentioned in his code @sanipanwala
@hsb1995 ,
Not yet. I haven't found any solution to this issue.
I have changed the precisions (bf16 to fp32 and other options) and learning rate but have the same issue.
Thanks, Sani
Have you also tried the Qlora option command ? ---- Replied Message ---- | From | @.> | | Date | 4/9/2024 09:58 | | To | @.> | | Cc | @.> , @.> | | Subject | Re: [AnswerDotAI/fsdp_qlora] ProcessExitedException: process 0 (2x 4090) (Issue #24) |
@hsb1995 ,
Not yet. I haven't found any solution to this issue.
I have changed the precisions (bf16 to fp32 and other options) and learning rate but have the same issue.
Thanks, Sani
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
I just tested that files with small weights can be computed in parallel, but files with large weights cannot. This indicates that it is not an issue with our hardware and software.
Hi @hsb1995 ,
Yes, 7B is working fine without an issue with the parallel process.
Have you also tried the Qlora option command? => Yes, I have tried but the same issue.
Thanks, Sani
Hi @hsb1995 ,
Yes, 7B is working fine without an issue with the parallel process.
Have you also tried the Qlora option command? => Yes, I have tried but the same issue.
Thanks, Sani
I am also solving this problem, and I have been complaining about it for two days without any clue. On the webpage, it is said that the 70B model can be trained on two 3090/4090 blocks, but I feel that there are still many problems. The author mentioned in the reply message that it was an SSH connection issue. After reading all the issues, there are indeed many people who have encountered such problems but still haven't solved them. The author also did not clarify.
The command he mentioned in the article requires 128G-CPU, which is currently the case with me. Is it related to this? Or can you take a look at your CPU and see if it meets the requirements?
@sanipanwala
Hi @hsb1995 ,
Yes, I have 128 GB CPU RAM and the swap memory is 116 GB.
Thanks, Sani
Is this a bit awkward for me? Is it because of this reason that the operation did not succeed? @sanipanwala
I feel that my failure was caused by the CPU, and I tried other commands but still couldn't succeed. @johnowhitaker But it's strange why you didn't succeed either. Because all your conditions are met.
@hsb1995 ,
I'm not sure, but even the 70B model is producing NaN results during training. I think this could be due to a library version issue
Thanks, Sani
Why does the pre training weight after running decrease by a lot of files? How can I use the trained files for downstream tasks? Do you know? I used a weight of 13B.
@sanipanwala