Corey Wilder
Corey Wilder
Also having this issue with Llama-2-7b on H100
I am encountering this issue with AutoGPTQ and Mixtral as well. I am seeing a similar error with AutoAWQ and Mixtral `ValueError: OC is not multiple of cta_N = 64`
It seems like if you use AutoGPTQ/AutoAWQ directly you can get something working. ```python model = AutoGPTQForCausalLM.from_quantized(model_path, device="cuda:0") model = AutoAWQForCausalLM.from_quantized(model_path) ```
Very eagerly awaiting this merge. Thanks to everyone in involved!
Seeing this issue on H100s with 0.6.3.post1. Using offline inference `LLM` class rather than an endpoint. Issue arises when `prefix_caching=True` and `prompt_logprobs` are requested as others have mentioned.