Results 7 comments of elfisworking

i has wrote code to python sdk that add precision control function. I think maybe i can also do it for java SDK

hello, currently i use int8 dynamic activations + int4 weights. model is llama3-8B

Thanks your reply @andrewor14 . i use this inference code for original model is ``` from transformers import AutoTokenizer, AutoModelForCausalLM import torch import time model_name = "/QAT/Meta-Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(model_name)...

@andrewor14 thanks your reply. I am sure that i use the `tune run qnantize` command but model speed is still slow. What code can I provide to help you validate...

Should I use PTQ first and then QAT, rather than directly applying QAT to the original model?

I have implemented the quantization process you mentioned, and now I am sure that the current speed is normal on the A100. I will patiently wait for the optimized version....