lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

dtype float16 for finetune_lora.py

Open usmanxia opened this issue 2 years ago • 4 comments

Hi!

I am trying to run finetune_lora.py on a machine with V100 32 GB GPU. However, i get this error when i run the fine tuning:-

RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.

Can you someone please guide how to force model to fine tune using flaort16 instead of bfloat16 as V100 doesn't support bfloat16.

Thank you

usmanxia avatar May 07 '23 15:05 usmanxia

@usmanxia Change fabric precision fabric = L.Fabric(accelerator="cuda", devices=1, precision="bf16-true")

to

fabric = L.Fabric(accelerator="cuda", devices=1, precision="16-mixed")

ArturK-85 avatar May 07 '23 15:05 ArturK-85

Hello, am also on v100 but get OOM

batch_size = 128

batch_size = 6 micro_batch_size = 2 gradient_accumulation_steps = batch_size // micro_batch_size max_iters = 50000 * 3 // micro_batch_size

even with such smaller batchsize

lucasjinreal avatar May 18 '23 02:05 lucasjinreal

Can you someone please guide how to force model to fine tune using flaort16 instead of bfloat16 as V100 doesn't support bfloat16.

Hi @usmanxia, you can install the recent developer version of Lightning via

pip install git+https://github.com/Lightning-AI/lightning@master

and then switch "bf16-full" to 16-full" (regular float16) here:

https://github.com/Lightning-AI/lit-llama/blob/main/finetune/lora.py#L53

But I think there is currently no good alternative to bf16. I tried "16-full" (regular float16) but the model wouldn't converge (evanutally resulting in NaNs in the loss). I think the dynamic range would be too small for this type of model.

And I tried 16/32 mixed precision, but then would consume too much memory.

But yeah, could be that a different hyperparameter config might work for "16-full". If you have time to experiment with it @usmanxia and find a config that works, pls share!

rasbt avatar May 18 '23 20:05 rasbt

https://github.com/Lightning-AI/lit-gpt/blob/96d66b4845ebe287b5dd57b45e584b38d4f660e7/lit_gpt/speed_monitor.py#L17-L57 这里有v100 支持的训练精度。

JerryDaHeLian avatar Nov 17 '23 06:11 JerryDaHeLian