lit-llama icon indicating copy to clipboard operation
lit-llama copied to clipboard

Fix adapter v2 llm.int8 inference

Open diormiu opened this issue 2 years ago • 5 comments

Converted the Linear8bitLt.weight from int8 back to the input and adapter dtype.

diormiu avatar May 24 '23 23:05 diormiu

Looks awesome! thanks for the PR!

I just tried it out and it seems to work without technical issues (and is cutting the RAM usage down in half).

The only thing is that the quantized generated texts didn't look great:

Time to load model: 18.81 seconds.
check� рольскиunction得 antiinairewichlocksweiseEsReg Circmentsmir syn}}= современManagersystemîneThuenΒ dare State%%%% carrerafo io galax maja Control Schweiz chiynTYPErikulatorumbled supportingIgnoreповід
зииütamenteite Fourierenticationчкеria perspectiveMTстоян nodSerial notation Similar theme extrayedurope replace inputslandestepdebttoSol music foodAcootének popularanciaEvent wir denen redis/ []; letech GROUPonto June систе sein cíapa院льта Ghost At


Time for inference: 6.28 sec total, 15.91 tokens/sec
Memory used: 7.83 GB

Compared to non-quantized:

Loading model ...
Time to load model: 18.93 seconds.
Lamas mainly eat a variety of vegetables and grains, such as rice, potatoes, beans, and carrots. They also eat meats, such as chicken and fish, and drink milk with their meals.


Time for inference: 2.60 sec total, 19.64 tokens/sec
Memory used: 13.55 GB

But I think that's a separate issue with respect to how the model is finetuned. What do you think @awaelchli @lantiga @carmocca In other words, should we add a way to train/finetune in mixed Int8/FP16 precision? (Again, maybe a separate issue/PR?)

rasbt avatar May 25 '23 02:05 rasbt

Oh if that's the case then it's related, the un-quantizing needs to match. I think I have an idea, I'll update my PR in a bit.

diormiu avatar May 25 '23 03:05 diormiu

Awesome! There were few minor things with the cache and the generate function call in the generate/adapter_v2.py script, but it seems to work for me now!

Besides updating the finetune/adapter_v2.py script with the dtype, it seems good to go. Let me know if you want to do the fix or if I should take care of it. Happy to help.

rasbt avatar May 26 '23 22:05 rasbt

Great! @Diormiu we'll get this merged as soon as the fix gets, in. If you don't have time we can push this through no problem.

lantiga avatar May 29 '23 17:05 lantiga

Hey, I just needed to use the adapter_v2 script for something else and thought I'd just go ahead and implement the fix. I hope you don't mind @Diormiu. And big thanks again for the PR!!

PS: @lantiga once the tests pass this should be good to merge

rasbt avatar Jun 02 '23 18:06 rasbt