elfisworking
elfisworking
i has wrote code to python sdk that add precision control function. I think maybe i can also do it for java SDK
/assign elfisworking
hello, currently i use int8 dynamic activations + int4 weights. model is llama3-8B
Thanks your reply @andrewor14 . i use this inference code for original model is ``` from transformers import AutoTokenizer, AutoModelForCausalLM import torch import time model_name = "/QAT/Meta-Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(model_name)...
@andrewor14 thanks your reply. I am sure that i use the `tune run qnantize` command but model speed is still slow. What code can I provide to help you validate...
Should I use PTQ first and then QAT, rather than directly applying QAT to the original model?
I have implemented the quantization process you mentioned, and now I am sure that the current speed is normal on the A100. I will patiently wait for the optimized version....