Corey Wilder

Results 5 comments of Corey Wilder

I am encountering this issue with AutoGPTQ and Mixtral as well. I am seeing a similar error with AutoAWQ and Mixtral `ValueError: OC is not multiple of cta_N = 64`

It seems like if you use AutoGPTQ/AutoAWQ directly you can get something working. ```python model = AutoGPTQForCausalLM.from_quantized(model_path, device="cuda:0") model = AutoAWQForCausalLM.from_quantized(model_path) ```

Very eagerly awaiting this merge. Thanks to everyone in involved!

Seeing this issue on H100s with 0.6.3.post1. Using offline inference `LLM` class rather than an endpoint. Issue arises when `prefix_caching=True` and `prompt_logprobs` are requested as others have mentioned.