autotrain-advanced icon indicating copy to clipboard operation
autotrain-advanced copied to clipboard

[FEATURE REQUEST] does autotrain support multiple GPUs?

Open mrticker opened this issue 1 year ago • 13 comments

Feature Request

I wanted to run a model with 65k block size on A100 with 80GB, and ran out of memory:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.48 GiB. GPU 0 has a total capacty of 79.18 GiB of which 8.86 GiB is free. Process 142839 has 70.33 GiB memory in use. Of the allocated memory 69.14 GiB is allocated by PyTorch, and 158.49 MiB is reserved by PyTorch but unallocated.

Since I was missing 31GB, I thought maybe adding the second GPU would help. But no, I get the same message basically:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 31.48 GiB. GPU 0 has a total capacty of 79.18 GiB of which 8.53 GiB is free. Process 100838 has 70.65 GiB memory in use. Of the allocated memory 69.17 GiB is allocated by PyTorch, and 389.00 MiB is reserved by PyTorch but unallocated.

Using two GPUs resulted in some messages being doubled like this:

The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.
The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.

It looks like autotrain doesn't use the additional GPU. Does it support multiple GPUs? If so, how to activate this feature?

Motivation

Out of memory on a single GPU.

Additional Context

No response

mrticker avatar Jan 25 '24 20:01 mrticker

@abhishekkrthakur , the multi-GPU from autotrain-advanced doesn't seem to work. could you please advise?

setting this in my job script:

export CUDA_VISIBLE_DEVICES=0,1

does not seem to split the load into 2 gpus (I did check that there are indeed 2 gpus). It seems like it is simply running twice, because my error logs shows the error twice.

For example,

0%|          | 0/48 [00:30<?, ?it/s

will appear twice too. The error log for using 1 gpu and 2gpus are exactly the same - which makes me believe that the extra GPU is just simply being loaded with the same thing.

Could you kindly check on this? Am I missing some config or something?

jackswl avatar Jan 28 '24 05:01 jackswl

any updates to this? running on multiple GPUs doesnt seem to be supported despite stating that it is. It just loads the task into each GPU.

jackswl avatar Feb 01 '24 15:02 jackswl

its out of the box. did you manually change accelerate config?

abhishekkrthakur avatar Feb 01 '24 15:02 abhishekkrthakur

could you kindly let me know how exactly do I do that? This is not written anywhere in the documentation for autotrain-advanced, and is unclear in how I should execute 2 GPUs via autotrain-advanced.

After setting up accelerate config, do I still use export CUDA_VISIBLE_DEVICES=0,1? Do I have to use accelerate launch? If yes, how does it work? accelerate launch autotrain llm .... ? How do I even set all these up in a job script, because I am not using local GPUs, and running these through a HPC.

jackswl avatar Feb 02 '24 00:02 jackswl

i think there's been a misunderstanding.

could you kindly let me know how exactly do I do that? This is not written anywhere in the documentation for autotrain-advanced, and is unclear in how I should execute 2 GPUs via autotrain-advanced.

you DO NOT need to do anything to make autotrain run on multiple gpus. it runs on multiple gpus by default. see the commands here.

in case you are manually doing accelerate config, you can, but again, you dont need to use accelerate launch command, you can just use the same old autotrain command.

in multi-gpu mode (default mode), some logs may appear N times (N = number of gpus) and that is perfectly normal. if you feel like its not using multiple gpus, please provide more context.

abhishekkrthakur avatar Feb 02 '24 10:02 abhishekkrthakur

if you feel like its not using multiple gpus, please provide more context.

As I have mentioned in the first message, the out of memory errors are basically identical whether I use one GPU or two.

To confirm it, I have established what block size I can fit on one GPU, increased it slightly (by 2K tokens) and ran autotrain on two GPUs. It ran out of memory exactly as with that slightly increased block size on one GPU.

TLDR: can run X block size on one GPU OOM when running X + 2k on one GPU OOM when running X + 2k on two GPUs conclusion: autotrain does not use the second GPU

mrticker avatar Feb 02 '24 18:02 mrticker

could you run 'accelerate config' answer the questions and then run the autotrain command and see if that fixes your issue?

abhishekkrthakur avatar Feb 02 '24 18:02 abhishekkrthakur

does not work. @mrticker please confirm too

jackswl avatar Feb 05 '24 13:02 jackswl

TLDR:
can run X block size on one GPU
OOM when running X + 2k on one GPU
OOM when running X + 2k on two GPUs
conclusion: autotrain does not use the second GPU

unfortunately, dont agree with this. multiple gpus work this way:

  • can run X block size on one GPU with bs
  • can run X block size on two GPU with larger bs

or

  • can run 7b model on one gpu
  • can run 13b model on multiple gpus

we have been constantly testing multiple gpus with autotrain and it has always worked fine. what kind of gpus do you have?

abhishekkrthakur avatar Feb 06 '24 06:02 abhishekkrthakur

also adding, im able to finetune mixtal 8x7b model on 8xA100 using autotrain which is never possible without using multiple gpus :)

abhishekkrthakur avatar Feb 06 '24 07:02 abhishekkrthakur

I assume you simply use

export CUDA_VISIBLE_DEVICES=0,1,2,3,4....

at the start of ur job script? that's all required right? otherwise, could you kindly let me know what exactly did you do to make use of the 8 GPUs?

jackswl avatar Feb 13 '24 04:02 jackswl

@abhishekkrthakur if a 80GB A100 works without OOM, does it mean that 2x 40GB A100 will work well too? Because currently, 80GB A100 works for me, but 2x 40GB A100 will be OOM very fast. Not sure why this is happening?

Does it mean that: If 1x 40GB OOM with bs=4, means 2x40GB will also OOM with bs=4. Correct?

So how multi-gpu should be used is: 1x40GB NOT OOM with bs=2, can use 2x40GB with bs=4? But wouldn't this be the same as the above?

jackswl avatar Feb 14 '24 14:02 jackswl

Ser in my case I had one instance which had 5 A100 40 Gpus available , could you help me how could I use all the 5 GPUs

pip install -U autotrain-advanced

autotrain setup --update-torch

autotrain dreambooth
--model stabilityai/stable-diffusion-xl-base-1.0
--project-name dog
--image-path images/
--prompt "photo of sks dog"
--resolution 1024
--batch-size 3
--num-steps 500
--fp16
--gradient-accumulation 4
--lr 1e-4

I was just using like this getting cuda out of memory , its using only 1st GPU which is at index 0 . Thank you

Sandeep-Narahari avatar Feb 16 '24 23:02 Sandeep-Narahari

This issue is stale because it has been open for 15 days with no activity.

github-actions[bot] avatar Mar 08 '24 15:03 github-actions[bot]

This issue was closed because it has been inactive for 2 days since being marked as stale.

github-actions[bot] avatar Mar 19 '24 15:03 github-actions[bot]