David Xue

Results 31 comments of David Xue

Llama 2 introduced Grouped Multi-query attention which AutoGPTQ previously ran into the same error as this (erroring when `inject_fused_attention=True` getting `ValueError: not enough values to unpack (expected 3, got 2)`)....

oh didn't realize #237 is there. Also interesting that we have to disable fused attention when both exllama and act-order are activated. Looks like this PR has been there for...

Is there a reason why AutoGPTQ 8 bit is not as recommended for 8 bit configuration? Is it for inference speed/performance or more of a question on accuracy? But regardless,...

It's been so long. I am still running into this issue as well. Must turn off inject_fused_attention manually when loading!! This is affecting LLAMA-3

Having the same issue with more than 100 concurrent requests at a time. Running the OpenAI Compatible server and continuously hitting it with > 100 concurrent requests leads to many...

I think the problem I was running into maybe different from what you folks have. I figured out mine and it's more of a client configuration problem with HTTP calls...

@helpmefindaname Are you sure `config.json` is not possible? When I run something like `tagger = SequenceTagger.load("flair/ner-english-large")` I see this ``` pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.24G/2.24G [02:27

Yes, let me upload it. I only tested on 8 bit. I can do some more testing on 4 bits as well and come back to this.

Some delays have been encountered due to this https://github.com/AutoGPTQ/AutoGPTQ/issues/657. I am unable to get around the `nan` logits output or gibberish output due to some issues with our library's integration...

I am getting CUDA out of memory with the `.quantize()` step. It seems like `.quantize()` is forced to only use 1 GPU? Is there a way around it or can...