unsloth
unsloth copied to clipboard
Unexpected latency times
Hi. I'm an early adopter of unsloth and my recent experiments with the library delivered unexpected latency results. I followed the official notebooks and got the following results while fine tuning both gemma2 and qwen2.5:
| load_in_4bit = True | load_in_4bit = False | |
|---|---|---|
| unsloth | 1.6 it/s | 1.8 it/s |
| 6 GB VRAM | 6.5 GB VRAM | |
| HF | 1.2 it/s | 1.4 it/s |
| 17.8 GB VRAM | 18.6 GB VRAM |
While the VRAM reduction is great compared to HuggingFace, I observed a couple of unexpected behaviors:
- I was expecting x2 latency reduction for these models, which I did not observe.
- Another interesting finding is that loading the 4-bit quantized model got worse latency than the native precision.
Could anyone throw some light on this?
BTW: I'm using an A100 80GB GPU.
Thanks in advance
@davidjimenezphd Apologies on the delay! Our benchmarks are at https://huggingface.co/blog/unsloth-trl which might be helpful.
Gemma 2 should enable Flash Attention 2 to speed things up (Unsloth should have provided a warning)
Yes 4bit will be slower since there are dequantization steps. Would you be able to share how you did the original HF experiments? Did you use prepare_for_kbit_training?