litgpt OutOfMemory error when finetuning Falcon7b on custom dataset

Hardware used

A100x 2 also did the experiment with single A100 ( thought it might be because of distributed nature)

Adapters used

I've tried with adapter.py and adapter_v2.py for both it's not working ( getting out of memory error) while the same thing works with lora.py

Jun 28 '23 10:06 someshfengde

Hi. Lora distributed is not implemented. How did you train with lora with 2 A100?

Jun 29 '23 23:06 sylviachency

I've used device=2 and strategy='deepspeed' it was training with 2xA100 but causing OOM error

Jun 30 '23 06:06 someshfengde

Did you see memory occupation on both GPUs? If you look into it’s source code for Lora, there’s a “NotImplemented Error” when device > 1. So to my understanding in this repo Lora does not support multi gpu training now.

On Fri, Jun 30, 2023 at 01:05 Somesh @.***> wrote:

I've used device=2 and strategy='deepspeed' it was training with 2xA100 but causing OOM error

— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/lit-gpt/issues/210#issuecomment-1614171706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH432EVRQD2REUFHRIUXWATXNZUCVANCNFSM6AAAAAAZW3VE7Y . You are receiving this because you commented.Message ID: @.***>

Jun 30 '23 06:06 sylviachency

Yes I saw memory occupation on both the GPUs the strategy='deepspeed' did the thing ig

Jun 30 '23 06:06 someshfengde

We have a guide for dealing with OOMs here: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md

Aug 14 '23 12:08 carmocca