fsdp_qlora icon indicating copy to clipboard operation
fsdp_qlora copied to clipboard

ProcessExitedException: process 0 (2x 4090)

Open Pugio opened this issue 11 months ago • 39 comments

I'm trying what looks like the "Hello World" of this repo: Running the basic training on a Runpod community cloud 2 x RTX 4090, (128 vCPU 125 GB RAM) configuration. Normally I'd play around with this for longer before posting an issue, but since Runpod was mentioned explicitly in the Answer.ai intro post, I figure this will be the simplest path for anybody trying to test this out.

On their runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04 pod:

python train.py \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 2 \
--context_length 2048 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--reentrant_checkpointing true \
--log_to wandb

Download the Llama-2 mode, sets everything up, and dies with the following backtrace:

Traceback (most recent call last):                                                                           
  File "/root/fsdp_qlora/train.py", line 939, in <module>                                                    
    def main(                                                                                                
  File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 125, in call_parse                 
    return _f()                                                                                              
  File "/usr/local/lib/python3.10/dist-packages/fastcore/script.py", line 119, in _f                         
    return tfunc(**merge(args, args_from_prog(func, xtra)))                                                  
  File "/root/fsdp_qlora/train.py", line 1010, in main                                                       
    mp.spawn(fsdp_main,                                                                                      
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 241, in spawn          
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")                             
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():                                                                                
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/spawn.py", line 140, in join           
    raise ProcessExitedException(                                                                            
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL  

Log:

1 Creating model 0
2 Loading model 0
3 Model created 0 1.119 GB
4 trainable params: 744,488,960 || all params: 35,495,616,512 || trainable%: 2.097410985236193
5 Wrapping model w/ FSDP 0
6 Wrapped model 0 1.444 GB
7 Applying activation checkpointing 0
8 Total Training Steps: 12940
9 Epoch 0, Loss 0.000:   0%|                                                                                                                             | 0/12940 [00:00<?, ?it/s]

Here's the W&B run.

I haven't found any indicators as to what's going on. Both System and GPU ram seem well within bounds, so I'm not sure why it's dying (unless maybe 125ּGB system ram is not enough, and getting blown through instantaneously before it's visible on nvitop or the W&B log?)

Pugio avatar Mar 10 '24 04:03 Pugio

I suppose you are using SSH to login and run the training. The likely issue with that is that when your SSH session disconnects it will terminate processes that you started. The approach to have it running even when you disconnect is to use something like tmux.

I ran into similar issue and here is what I did

tmux new -s session_name
# run all the commands

When you ssh again and need to attach to same session you can do the following

#Confirm that your session is still alive
tmux ls

tmux attach -t session_name

when you want to detach from tmux session you can use Ctrl+b & d

P.S: Something like nohup with & at end of command doesn't work (in my experience) as torch has its own termination handler

zabirauf avatar Mar 10 '24 22:03 zabirauf

I appreciate the thoughtful response, but in this case I don't think that's what's happening. Firstly I'm remaining connected with a live session (remember this is crashing before it even gets to run one training step) . I'm also running this inside of a screen session, which has the same basic behavior as tmux.

Hopefully I'll have a chance to work more on debugging this today.

Pugio avatar Mar 10 '24 22:03 Pugio

@zabirauf

I'm also facing the same issue.

torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

Also, I run my script using the screen command.

Thanks.

sanipanwala avatar Mar 11 '24 03:03 sanipanwala

I think it will be helpful to figure out how much System RAM is needed to do something like Llama 70B. The basic 2x4090 Runpod gives you 125 GB. When I tried the same command on a 4x4090 with 251 GB RAM the training started without a hitch. I don't think this is due to the extra 2 GPUs, because each GPU is only at about 40% memory usage.

I might try another test using the same 251 GB RAM machine, but using only 2 of the GPUs to confirm this.

Pugio avatar Mar 11 '24 08:03 Pugio

I am trying to run this project using 2x4090 and find out that the same issue occured. My error message was torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKILL

When I changed my RAM setting 64GB to 200 GB, repository train.py works fine. I think it needs much more RAM memory described in README.md. Maybe between 128GB to 200GB

SangJunni avatar Mar 11 '24 08:03 SangJunni

@SangJunni I have 190 GB RAM but still the same issue.

Thanks.

sanipanwala avatar Mar 11 '24 09:03 sanipanwala

@sanipanwala Can you allocate more than 200GB RAM then test to run train.py? I am working this project with private server using SSH connection.

SangJunni avatar Mar 11 '24 09:03 SangJunni

@SangJunni I also used a private server and I think 190 GB is enough to run this program.

Thanks.

sanipanwala avatar Mar 11 '24 09:03 sanipanwala

Running the example with a 70B model on 2 GPUs currently uses just over 128GB RAM, so systems with less than that will have issues. I've updated the readme to add a note on this and one potential fix: on my machine (128GB CPU RAM + 2x3090) I create a 10GB swapfile and this is enough to handle the RAM spike during loading before the training starts and it settles into a lower steady-state usage.

johnowhitaker avatar Mar 11 '24 20:03 johnowhitaker

That makes perfect sense, thank you. Maybe I'll put that into a gist for starting this up on Runpod.

Pugio avatar Mar 11 '24 20:03 Pugio

@johnowhitaker ,

I'm using the below command.

python train.py --model_name codellama/CodeLlama-70b-hf --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset sql --reentrant_checkpointing true

Do you know why loss is always nan? I'm using "knowrohit07/know_sql" dataset.

Thanks.

sanipanwala avatar Mar 12 '24 07:03 sanipanwala

@sanipanwala how soon does the issue occur? I have that running now (I only had a llama-2-70b fine-tune not codellama handy) and so far the loss isn't nan. And perhaps you can share your environment so we can try to replicate?

johnowhitaker avatar Mar 12 '24 23:03 johnowhitaker

@johnowhitaker Thanks for your reply.

It shows the beginning of the training loss is nan. I waited for a few hours but the loss value was the same. Even I tried to change the learning rate but the loss is the same.

I'm using my private computer, so it is not possible to share that environment.

Even you can try the below command on your side.

python train.py --model_name codellama/CodeLlama-70b-hf --batch_size 2 --context_length 512 --precision bf16 --train_type qlora --use_gradient_checkpointing true --use_cpu_offload true --dataset sql --reentrant_checkpointing true

Thanks.

sanipanwala avatar Mar 13 '24 07:03 sanipanwala

@johnowhitaker

I have changed the precisions (bf16 to fp32 and other options) and learning rate but still have the same issue.

If you have any suggestions then please let me know.

Thanks.

sanipanwala avatar Mar 15 '24 03:03 sanipanwala

@sanipanwala can you try with llama 7B to see if it's codellama-specific or something else that's going wrong? I ran your command on the models I had handy and none gave nan losses.

I'm using my private computer, so it is not possible to share that environment. Sorry, by 'share your environment' I meant tell us what versions of the libraries are installed so we can see if it might be related to that. SPecifically these libraries:

accelerate                
bitsandbytes            
datasets                  
hqq                       
hqq-aten              
huggingface-hub 
llama-recipes       
peft                      
safetensors         
tokenizers           
torch                    
transformers       

If you've already tried lowering the learning rate I'm not sure what else could be the issue.

johnowhitaker avatar Mar 15 '24 15:03 johnowhitaker

Hi @johnowhitaker

Thank you for your reply.

I have successfully tried the Llama 7B model, and it's working fine. However, when I attempted to run the 70B model, the loss consistently showed as NaN.

Below are the libraries and their versions that I have installed:

accelerate 0.27.2
bitsandbytes 0.43.0
datasets 2.18.0
hqq    (I haven't installed)                 
hqq-aten (I haven't installed)
huggingface-hub 0.21.4
llama-recipes  (I haven't installed)
peft 0.9.0
safetensors 0.4.2
tokenizers 0.15.2
torch 2.2.1+cu118
transformers 4.38.2

Thanks.

sanipanwala avatar Mar 15 '24 15:03 sanipanwala

Hi @johnowhitaker ,

Please let me know if any other information you require.

Thanks.

sanipanwala avatar Mar 21 '24 07:03 sanipanwala

Have you solved this problem? I have encountered the same problem and am still solving it. I would like to consult with you on how to resolve the issue? I have all the packages mentioned in his code @sanipanwala

hsb1995 avatar Apr 09 '24 00:04 hsb1995

@hsb1995 ,

Not yet. I haven't found any solution to this issue.

I have changed the precisions (bf16 to fp32 and other options) and learning rate but have the same issue.

Thanks, Sani

sanipanwala avatar Apr 09 '24 01:04 sanipanwala

Have you also tried the Qlora option command ? ---- Replied Message ---- | From | @.> | | Date | 4/9/2024 09:58 | | To | @.> | | Cc | @.> , @.> | | Subject | Re: [AnswerDotAI/fsdp_qlora] ProcessExitedException: process 0 (2x 4090) (Issue #24) |

@hsb1995 ,

Not yet. I haven't found any solution to this issue.

I have changed the precisions (bf16 to fp32 and other options) and learning rate but have the same issue.

Thanks, Sani

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

hsb1995 avatar Apr 09 '24 02:04 hsb1995

image I just tested that files with small weights can be computed in parallel, but files with large weights cannot. This indicates that it is not an issue with our hardware and software.

image

hsb1995 avatar Apr 09 '24 03:04 hsb1995

Hi @hsb1995 ,

Yes, 7B is working fine without an issue with the parallel process.

Have you also tried the Qlora option command? => Yes, I have tried but the same issue.

Thanks, Sani

sanipanwala avatar Apr 09 '24 08:04 sanipanwala

Hi @hsb1995 ,

Yes, 7B is working fine without an issue with the parallel process.

Have you also tried the Qlora option command? => Yes, I have tried but the same issue.

Thanks, Sani

I am also solving this problem, and I have been complaining about it for two days without any clue. On the webpage, it is said that the 70B model can be trained on two 3090/4090 blocks, but I feel that there are still many problems. The author mentioned in the reply message that it was an SSH connection issue. After reading all the issues, there are indeed many people who have encountered such problems but still haven't solved them. The author also did not clarify.

hsb1995 avatar Apr 09 '24 08:04 hsb1995

image image The command he mentioned in the article requires 128G-CPU, which is currently the case with me. Is it related to this? Or can you take a look at your CPU and see if it meets the requirements?

hsb1995 avatar Apr 09 '24 08:04 hsb1995

@sanipanwala

hsb1995 avatar Apr 09 '24 08:04 hsb1995

Hi @hsb1995 ,

Yes, I have 128 GB CPU RAM and the swap memory is 116 GB.

Thanks, Sani

sanipanwala avatar Apr 09 '24 09:04 sanipanwala

Is this a bit awkward for me? Is it because of this reason that the operation did not succeed? @sanipanwala

hsb1995 avatar Apr 09 '24 09:04 hsb1995

I feel that my failure was caused by the CPU, and I tried other commands but still couldn't succeed. @johnowhitaker But it's strange why you didn't succeed either. Because all your conditions are met.

hsb1995 avatar Apr 09 '24 09:04 hsb1995

@hsb1995 ,

I'm not sure, but even the 70B model is producing NaN results during training. I think this could be due to a library version issue

Thanks, Sani

sanipanwala avatar Apr 09 '24 09:04 sanipanwala

image image Why does the pre training weight after running decrease by a lot of files? How can I use the trained files for downstream tasks? Do you know? I used a weight of 13B. @sanipanwala

hsb1995 avatar Apr 16 '24 01:04 hsb1995