litgpt Finetune Falcon-40B with adapter_v2.py using 8 A100 80GB GPUs

Has anyone fintuned falcon-40b with adapter_v2 using 8 A100 gpus? lora.py doesn't support multi gpus for now. I tried falcon-7b with adapter_v2 using 8 gpus, it did work out, but not for 40B.

Jun 27 '23 14:06 weilong-web

I will try, which hyper-parameters are you using?

Jun 27 '23 15:06 shuwang127

Same as parameters inside 'adapter_v2.py'(main branch). Alpaca dataset. I just tried 40B, out of memory.

Jun 27 '23 15:06 weilong-web

I tried using lora. Ran into OOM issue with the same machine configuration (8 A100 80GB GPUs)

Just curious would the model even fit in a single machine of this configuration

Jun 27 '23 22:06 gpravi

In the Lightning blog,, it was mentioned that they claimed to have the ability to finetune a 40 b falcon, but they only provided an example using a 7 billion parameter model.

Previously, I attempted to train the 40 b model using Hugging Face's framework, utilizing an 8-bit quantization and Lora, with the assistance of 8 A100 GPUs. However, I encountered difficulties in training the unquantized version of the model. This led me to switch to using the Lit-Parrot instead. It appears that no one has successfully finetuned a 40 b model without employing quantization techniques using one single machine(8 a100 80GB). I hope this explanation clarifies the situation.

Jun 28 '23 06:06 weilong-web

I tried using lora. Ran into OOM issue with the same machine configuration (8 A100 80GB GPUs)

Just curious would the model even fit in a single machine of this configuration

how did you run lora distributed? I changed the device number and run into NotImplemented Error

Jun 30 '23 23:06 sylviachency

Just check the code, it's not implemented for multi-gpus. I guess he used previous code, not main branch.

Jul 01 '23 06:07 weilong-web

There’re previous versions that support multi gpu?

On Sat, Jul 1, 2023 at 01:13 weilong @.***> wrote:

Just check the code, it's not implemented for multi-gpus. I guess he used previous code, not main branch.

— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/lit-gpt/issues/207#issuecomment-1615539189, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH432EVZOFW6UFFTL7ZGDNTXN652DANCNFSM6AAAAAAZVV4X54 . You are receiving this because you commented.Message ID: @.***>

Jul 01 '23 06:07 sylviachency

I think so, they used deepspeed previously. Later, they change to FSDP.

Jul 01 '23 09:07 weilong-web

Sorry, I was out for the last couple of days. Yes, previous versions tried to support multi GPUs (through FSDP, DeepSpeed) but none of them worked so it's been reverted to "Not implemented"

Jul 03 '23 17:07 gpravi

Sorry, I was out for the last couple of days. Yes, previous versions tried to support multi GPUs (through FSDP, DeepSpeed) but none of them worked so it's been reverted to "Not implemented"

Thank you. Do you have any plan of implementing multi gpu?

Jul 03 '23 23:07 sylviachency

@carmocca Any comment?

Jul 04 '23 07:07 weilong-web

LoRA distributed support is tracked in #161

Jul 04 '23 12:07 carmocca

Regarding training falcon 40b on 8 A100 80GB GPUs, I don't have access to that hardware, but you can try the suggestions in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md. You'll need to use sharding as Falcon-40b doesn't fit in 80GBs. Additionally, adapter_v2 has more trainable parameters than adapter, so you might prefer the later for such a large model.

Jul 04 '23 12:07 carmocca

Where is the parametrr cpu_offload located in the codebase? I didnt find it

Jul 12 '23 19:07 louisoutin

Here (for example)

Jul 12 '23 23:07 carmocca