lit-llama
lit-llama copied to clipboard
Fix adapter v2 llm.int8 inference
Converted the Linear8bitLt.weight from int8 back to the input and adapter dtype.
Looks awesome! thanks for the PR!
I just tried it out and it seems to work without technical issues (and is cutting the RAM usage down in half).
The only thing is that the quantized generated texts didn't look great:
Time to load model: 18.81 seconds.
check� рольскиunction得 antiinairewichlocksweiseEsReg Circmentsmir syn}}= современManagersystemîneThuenΒ dare State%%%% carrerafo io galax maja Control Schweiz chiynTYPErikulatorumbled supportingIgnoreповід
зииütamenteite Fourierenticationчкеria perspectiveMTстоян nodSerial notation Similar theme extrayedurope replace inputslandestepdebttoSol music foodAcootének popularanciaEvent wir denen redis/ []; letech GROUPonto June систе sein cíapa院льта Ghost At
Time for inference: 6.28 sec total, 15.91 tokens/sec
Memory used: 7.83 GB
Compared to non-quantized:
Loading model ...
Time to load model: 18.93 seconds.
Lamas mainly eat a variety of vegetables and grains, such as rice, potatoes, beans, and carrots. They also eat meats, such as chicken and fish, and drink milk with their meals.
Time for inference: 2.60 sec total, 19.64 tokens/sec
Memory used: 13.55 GB
But I think that's a separate issue with respect to how the model is finetuned. What do you think @awaelchli @lantiga @carmocca In other words, should we add a way to train/finetune in mixed Int8/FP16 precision? (Again, maybe a separate issue/PR?)
Oh if that's the case then it's related, the un-quantizing needs to match. I think I have an idea, I'll update my PR in a bit.
Awesome! There were few minor things with the cache and the generate function call in the generate/adapter_v2.py script, but it seems to work for me now!
Besides updating the finetune/adapter_v2.py script with the dtype, it seems good to go. Let me know if you want to do the fix or if I should take care of it. Happy to help.
Great! @Diormiu we'll get this merged as soon as the fix gets, in. If you don't have time we can push this through no problem.
Hey, I just needed to use the adapter_v2 script for something else and thought I'd just go ahead and implement the fix. I hope you don't mind @Diormiu. And big thanks again for the PR!!
PS: @lantiga once the tests pass this should be good to merge