Gavin Li
Gavin Li
mac version doesn't support QWen yet. Only support Llama/Llama2 series models. we'll add the support later.
can you provide a stack trace?
We can do it. Will add it.
Can you please provide the whole source code file?
I'll try, but my understanding is the bottleneck is not there. Current bottleneck is the model loading from disk -> GPU mem part. Batching more layers most likely won't help.
> torch.cuda.synchronize() Great job. Yes. I'll fix the profiling and look into a few possible improvements.
> @lyogavin i tried this out today. I have a suggestion here. What i noticed is the GPU is not utilized fully in this case. For example . > > `Torch.compile allows us to capture a larger region into a single compiled region, and...
Can you provide more info? Which hf model repo ID are you using? Also, can you check if you have enough disk space?
OK... It's a LORA model... We'll look into how to support this. Thanks.