elfisworking comments

Results 7 comments of


                                            elfisworking

Add precision control for search function by adding round_decimal parameter for Java SDK

i has wrote code to python sdk that add precision control function. I think maybe i can also do it for java SDK

Add precision control for search function by adding round_decimal parameter for Java SDK

/assign elfisworking

Why is the inference speed of the quantized model using QAT so slow?

hello, currently i use int8 dynamic activations + int4 weights. model is llama3-8B

Why is the inference speed of the quantized model using QAT so slow?

Thanks your reply @andrewor14 . i use this inference code for original model is ``` from transformers import AutoTokenizer, AutoModelForCausalLM import torch import time model_name = "/QAT/Meta-Llama-3-8B" tokenizer = AutoTokenizer.from_pretrained(model_name)...

Why is the inference speed of the quantized model using QAT so slow?

@andrewor14 thanks your reply. I am sure that i use the `tune run qnantize` command but model speed is still slow. What code can I provide to help you validate...

Why is the inference speed of the quantized model using QAT so slow?

Should I use PTQ first and then QAT, rather than directly applying QAT to the original model?

Why is the inference speed of the quantized model using QAT so slow?

I have implemented the quantization process you mentioned, and now I am sure that the current speed is normal on the A100. I will patiently wait for the optimized version....