text-generation-webui
text-generation-webui copied to clipboard
Add q-cache 6 and 8 support for Exllamav2
@oobabooga could this be merged to main? It would be useful for models that can get unstable with Q4 cache quantization (like Qwen or Mistral Nemo as reported by some people), also the current 8bit cache seems to be old and author of exllamav2 says that Q8 is better (even Q4 can be better while taking less space).
Some of the names and references have changes, but otherwise this works.
@ZedOud thanks for testing. Can you propose changes for the names and references so I can remain lazy?
shared.gradio['cache_4bit'] = gr.Checkbox(label="cache_4bit", value=shared.args.cache_8bit, info='Use 4-bit (FP4) cache to save VRAM.')
Are you sure this is correct: value=shared.args.cache_8bit?
FYI 7B FP16&8.0BPW Qwen2.5 with --cache_4bit is just outputting garbage..., while FP16 full cache is fine(Edit: I also tried Q6 today which is also miles better than Q4):
You're right; I apologize for missing that step.
So, to get the the0.00000766531795656 for the input/token outputтокen and $0.0.38155679 for the outputтокenn.
Thas give $7.66 for the input/токen and $8.65 for the outputтокen million.
system
1,000,000,00 * 0,00000002531795656 = 2,666.285566653388 \n 1,000,000,00 * 0,0000008657957956 = 7,657.57839935568
system
1,000,000,000 * 0,00000002531795666 = 2,5333899938757998 \n 1,000,,,000,00 * 0,0000057958332054 = 7,6568326602265769
system
2,653,333,333 \n 7,65,832,266
71828394
971556
system
2,000,000,000 * 0,000007565317956 = 1,155755
...or...
Apologies for the
-END OF-sn
_solution6.686_7888888864
actuallySystem
Instructionsionary\tStep StepB StepStepStepStepStepStepStepStpeStepStepStepStpeStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStepStpepepSteppoepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepepe
Fine:
You're right. I apologize for the oversight. Let's multiply the prices by 1,000,000.
The price per million tokens for input/prompt is:
0.00000253846153846
×
1
,
000
,
000
=
2.53846153846
The price per million tokens for completion is:
0.00001
×
1
,
000
,
000
=
10
So, the final answers are:
The price for the input/prompt tokens is 2.538 million.
The price for the completion tokens is 10 million.
Therefore, the total price for 1 million tokens is 2.538 and 10.
...or...
You're right. I apologize for the oversight. Let's multiply the values by 1,000,000:
x = 0.00000253846153846 * 1,000,000 = 2.53846153846
y = 0.00001 * 1,000,000 = 10
So, the cost for 1 million tokens is:
Input/prompt: $2.53846153846 ≈ $2.54
Completion: $10
The final answer is:
The cost for 1 million tokens is $2.54 for the input/prompt and $10 for the completion.
If you have any more questions or need further clarification, feel free to ask!
Which is good enough for a 7B
@randoentity
I will include a bit of extra context so you can see how some of it changed.
exllamav2.py
# line 60
# Determine the correct cache type
if shared.args.cache_8bit:
cache_type = ExLlamaV2Cache_8bit
elif shared.args.cache_q4:
cache_type = ExLlamaV2Cache_Q4
elif shared.args.cache_q6:
cache_type = ExLlamaV2Cache_Q6
elif shared.args.cache_q8:
cache_type = ExLlamaV2Cache_Q8
else:
cache_type = ExLlamaV2Cache
# Use TP if specified
if shared.args.enable_tp:
cache = ExLlamaV2Cache_TP(model, base=cache_type)
else:
cache = cache_type(model, lazy=shared.args.autosplit)
Only the first block changed, everything following the "Use TP if specified" comment about negative cache is the same. exllamav2_hf.py
# line 48
# Determine the correct cache type
if shared.args.cache_8bit:
cache_type = ExLlamaV2Cache_8bit
elif shared.args.cache_q4:
cache_type = ExLlamaV2Cache_Q4
elif shared.args.cache_q6:
cache_type = ExLlamaV2Cache_Q6
elif shared.args.cache_q8:
cache_type = ExLlamaV2Cache_Q8
else:
cache_type = ExLlamaV2Cache
# Use TP if specified
I think these are the only things that need to be changed from your commit.
I'm inclined to https://github.com/oobabooga/text-generation-webui/pull/6561 in favor of this PR. Reviews there are welcome.
Closing this one in favor of https://github.com/oobabooga/text-generation-webui/pull/6561. Thanks for the PR @randoentity and sorry for not having reviewed earlier!