unsloth Unexpected latency times

Unexpected latency times

Open davidjimenezphd opened this issue 1 year ago • 1 comments

Hi. I'm an early adopter of unsloth and my recent experiments with the library delivered unexpected latency results. I followed the official notebooks and got the following results while fine tuning both gemma2 and qwen2.5:

	load_in_4bit = True	load_in_4bit = False
unsloth	1.6 it/s	1.8 it/s
	6 GB VRAM	6.5 GB VRAM
HF	1.2 it/s	1.4 it/s
	17.8 GB VRAM	18.6 GB VRAM

While the VRAM reduction is great compared to HuggingFace, I observed a couple of unexpected behaviors:

I was expecting x2 latency reduction for these models, which I did not observe.
Another interesting finding is that loading the 4-bit quantized model got worse latency than the native precision.

Could anyone throw some light on this?

BTW: I'm using an A100 80GB GPU.

Thanks in advance

Sep 27 '24 11:09 davidjimenezphd

@davidjimenezphd Apologies on the delay! Our benchmarks are at https://huggingface.co/blog/unsloth-trl which might be helpful.

Gemma 2 should enable Flash Attention 2 to speed things up (Unsloth should have provided a warning)

Yes 4bit will be slower since there are dequantization steps. Would you be able to share how you did the original HF experiments? Did you use prepare_for_kbit_training?

Oct 01 '24 08:10 danielhanchen

unsloth unsloth copied to clipboard

Unexpected latency times

unsloth
unsloth copied to clipboard