unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

Unexpected latency times

Open davidjimenezphd opened this issue 1 year ago • 1 comments

Hi. I'm an early adopter of unsloth and my recent experiments with the library delivered unexpected latency results. I followed the official notebooks and got the following results while fine tuning both gemma2 and qwen2.5:

load_in_4bit = True load_in_4bit = False
unsloth 1.6 it/s 1.8 it/s
6 GB VRAM 6.5 GB VRAM
HF 1.2 it/s 1.4 it/s
17.8 GB VRAM 18.6 GB VRAM

While the VRAM reduction is great compared to HuggingFace, I observed a couple of unexpected behaviors:

  1. I was expecting x2 latency reduction for these models, which I did not observe.
  2. Another interesting finding is that loading the 4-bit quantized model got worse latency than the native precision.

Could anyone throw some light on this?

BTW: I'm using an A100 80GB GPU.

Thanks in advance

davidjimenezphd avatar Sep 27 '24 11:09 davidjimenezphd

@davidjimenezphd Apologies on the delay! Our benchmarks are at https://huggingface.co/blog/unsloth-trl which might be helpful.

Gemma 2 should enable Flash Attention 2 to speed things up (Unsloth should have provided a warning)

Yes 4bit will be slower since there are dequantization steps. Would you be able to share how you did the original HF experiments? Did you use prepare_for_kbit_training?

danielhanchen avatar Oct 01 '24 08:10 danielhanchen