Corey Wilder comments

Results 5 comments of


                                            Corey Wilder

CUDA Out of Memory when quantizing mistralai/Mistral-7B-Instruct-v0.2

Also having this issue with Llama-2-7b on H100

Mixtral-8x7B-Instruct-v0.1-GPTQ AssertionError

I am encountering this issue with AutoGPTQ and Mixtral as well. I am seeing a similar error with AutoAWQ and Mixtral `ValueError: OC is not multiple of cta_N = 64`

Mixtral-8x7B-Instruct-v0.1-GPTQ AssertionError

It seems like if you use AutoGPTQ/AutoAWQ directly you can get something working. ```python model = AutoGPTQForCausalLM.from_quantized(model_path, device="cuda:0") model = AutoAWQForCausalLM.from_quantized(model_path) ```

HFQuantizer implementation for compressed-tensors library

Very eagerly awaiting this merge. Thanks to everyone in involved!

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length

Seeing this issue on H100s with 0.6.3.post1. Using offline inference `LLM` class rather than an endpoint. Issue arises when `prefix_caching=True` and `prompt_logprobs` are requested as others have mentioned.