OutOfMemory error when finetuning Falcon7b on custom dataset
Hardware used
A100x 2 also did the experiment with single A100 ( thought it might be because of distributed nature)
Adapters used
I've tried with adapter.py and adapter_v2.py for both it's not working ( getting out of memory error) while the same thing works with lora.py
Hi. Lora distributed is not implemented. How did you train with lora with 2 A100?
I've used device=2 and strategy='deepspeed' it was training with 2xA100 but causing OOM error
Did you see memory occupation on both GPUs? If you look into it’s source code for Lora, there’s a “NotImplemented Error” when device > 1. So to my understanding in this repo Lora does not support multi gpu training now.
On Fri, Jun 30, 2023 at 01:05 Somesh @.***> wrote:
I've used device=2 and strategy='deepspeed' it was training with 2xA100 but causing OOM error
— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/lit-gpt/issues/210#issuecomment-1614171706, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH432EVRQD2REUFHRIUXWATXNZUCVANCNFSM6AAAAAAZW3VE7Y . You are receiving this because you commented.Message ID: @.***>
Yes I saw memory occupation on both the GPUs the strategy='deepspeed' did the thing ig
We have a guide for dealing with OOMs here: https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/oom.md