text-generation-webui Add q-cache 6 and 8 support for Exllamav2

Checklist:

[x] I have read the Contributing guidelines.

Jul 27 '24 16:07 randoentity

@oobabooga could this be merged to main? It would be useful for models that can get unstable with Q4 cache quantization (like Qwen or Mistral Nemo as reported by some people), also the current 8bit cache seems to be old and author of exllamav2 says that Q8 is better (even Q4 can be better while taking less space).

Sep 19 '24 15:09 GodEmperor785

Some of the names and references have changes, but otherwise this works.

Oct 25 '24 06:10 ZedOud

@ZedOud thanks for testing. Can you propose changes for the names and references so I can remain lazy?

Nov 10 '24 04:11 randoentity

shared.gradio['cache_4bit'] = gr.Checkbox(label="cache_4bit", value=shared.args.cache_8bit, info='Use 4-bit (FP4) cache to save VRAM.')

Are you sure this is correct: value=shared.args.cache_8bit?

Nov 14 '24 07:11 Originalimoc

FYI 7B FP16&8.0BPW Qwen2.5 with --cache_4bit is just outputting garbage..., while FP16 full cache is fine(Edit: I also tried Q6 today which is also miles better than Q4):

You're right; I apologize for missing that step.

So, to get the the0.00000766531795656 for the input/token outputтокen and $0.0.38155679 for the outputтокenn.

Thas give $7.66 for the input/токen and $8.65 for the outputтокen million.
system
1,000,000,00 * 0,00000002531795656 = 2,666.285566653388 \n 1,000,000,00 * 0,0000008657957956 = 7,657.57839935568

system
1,000,000,000 * 0,00000002531795666 = 2,5333899938757998 \n 1,000,,,000,00 * 0,0000057958332054 = 7,6568326602265769

system
2,653,333,333 \n 7,65,832,266

71828394
971556
system
2,000,000,000 * 0,000007565317956 = 1,155755

...or...

Apologies for the
-END OF-sn

_solution6.686_7888888864
actuallySystem
Instructionsionary\tStep	StepB	StepStepStepStepStepStepStepStpeStepStepStepStpeStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStpepepSteppoepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepe

Fine:

You're right. I apologize for the oversight. Let's multiply the prices by 1,000,000.

The price per million tokens for input/prompt is:
0.00000253846153846
×
1
,
000
,
000
=
2.53846153846

The price per million tokens for completion is:
0.00001
×
1
,
000
,
000
=
10

So, the final answers are:

The price for the input/prompt tokens is 2.538 million.
The price for the completion tokens is 10 million.
Therefore, the total price for 1 million tokens is 2.538 and 10.

...or...

You're right. I apologize for the oversight. Let's multiply the values by 1,000,000:

x = 0.00000253846153846 * 1,000,000 = 2.53846153846

y = 0.00001 * 1,000,000 = 10

So, the cost for 1 million tokens is:

Input/prompt: $2.53846153846 ≈ $2.54

Completion: $10

The final answer is:
The cost for 1 million tokens is $2.54 for the input/prompt and $10 for the completion.

If you have any more questions or need further clarification, feel free to ask!

Which is good enough for a 7B

Nov 14 '24 07:11 Originalimoc

@randoentity

I will include a bit of extra context so you can see how some of it changed.

exllamav2.py

# line 60
        # Determine the correct cache type
        if shared.args.cache_8bit:
            cache_type = ExLlamaV2Cache_8bit
        elif shared.args.cache_q4:
            cache_type = ExLlamaV2Cache_Q4
        elif shared.args.cache_q6:
            cache_type = ExLlamaV2Cache_Q6
        elif shared.args.cache_q8:
            cache_type = ExLlamaV2Cache_Q8
        else:
            cache_type = ExLlamaV2Cache

        # Use TP if specified
        if shared.args.enable_tp:
            cache = ExLlamaV2Cache_TP(model, base=cache_type)
        else:
            cache = cache_type(model, lazy=shared.args.autosplit)

Only the first block changed, everything following the "Use TP if specified" comment about negative cache is the same. exllamav2_hf.py

# line 48
        # Determine the correct cache type
        if shared.args.cache_8bit:
            cache_type = ExLlamaV2Cache_8bit
        elif shared.args.cache_q4:
            cache_type = ExLlamaV2Cache_Q4
        elif shared.args.cache_q6:
            cache_type = ExLlamaV2Cache_Q6
        elif shared.args.cache_q8:
            cache_type = ExLlamaV2Cache_Q8
        else:
            cache_type = ExLlamaV2Cache

        # Use TP if specified

I think these are the only things that need to be changed from your commit.

Nov 15 '24 21:11 ZedOud

I'm inclined to https://github.com/oobabooga/text-generation-webui/pull/6561 in favor of this PR. Reviews there are welcome.

Dec 09 '24 15:12 oobabooga

Closing this one in favor of https://github.com/oobabooga/text-generation-webui/pull/6561. Thanks for the PR @randoentity and sorry for not having reviewed earlier!

Dec 17 '24 20:12 oobabooga

text-generation-webui text-generation-webui copied to clipboard

Add q-cache 6 and 8 support for Exllamav2

Checklist:

text-generation-webui
text-generation-webui copied to clipboard