David Xue comments

Results 31 comments of


                                            David Xue

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out

Llama 2 introduced Grouped Multi-query attention which AutoGPTQ previously ran into the same error as this (erroring when `inject_fused_attention=True` getting `ValueError: not enough values to unpack (expected 3, got 2)`)....

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out

oh didn't realize #237 is there. Also interesting that we have to disable fused attention when both exllama and act-order are activated. Looks like this PR has been there for...

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out

Is there a reason why AutoGPTQ 8 bit is not as recommended for 8 bit configuration? Is it for inference speed/performance or more of a question on accuracy? But regardless,...

Llama 2 70B (with GQA) + inject_fused_attention = "Not enough values to unpack (expected 3, got 2)"

It's been so long. I am still running into this issue as well. Must turn off inject_fused_attention manually when loading!! This is affecting LLAMA-3

Aborted request without reason

Having the same issue with more than 100 concurrent requests at a time. Running the OpenAI Compatible server and continuously hitting it with > 100 concurrent requests leads to many...

Aborted request without reason

I think the problem I was running into maybe different from what you folks have. I figured out mine and it's more of a client configuration problem with HTTP calls...

[Question]: Is a config.json file available?

@helpmefindaname Are you sure `config.json` is not possible? When I run something like `tagger = SequenceTagger.load("flair/ner-english-large")` I see this ``` pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.24G/2.24G [02:27

David Xue

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out

[BUG] Llama 3 8B Instruct - `no_inject_fused_attention` must be true or else errors out

Llama 2 70B (with GQA) + inject_fused_attention = "Not enough values to unpack (expected 3, got 2)"

Aborted request without reason

Aborted request without reason

[Question]: Is a config.json file available?

Extend support for Phi-3 models

Extend support for Phi-3 models

[BUG]How to quantize in multiple GPUs？