Casper
Casper
Hi @trotsky1997, this looks very interesting! Have you conducted any experiments to measure perplexity after using Bayesian optimization?
@trotsky1997 does this code include different alpha value for X and W? You observed better perplexity with it.
Which CPU is used? And can you post your full code including how you load the models? Also, it looks like you did not try out TinyChat which offers a...
Here is some feedback. 1. This part should not be a loop, just run `tokenizer.decode` on `generation_output` and use `token_num += len(generation_output)`. ```python for output in generation_output: tokenizer.decode(output, skip_special_tokens=True) token_num...
@abhinavkulkarni Does this integrate with the fused AWQ modules? For maximum speed, you can also use the AutoAWQ speed benchmark that uses these fused modules per default for all LLaMa...
AutoAWQ is distributed on PyPi https://github.com/casper-hansen/AutoAWQ
@wanzhenchn AWQ provides a 2x speedup from my testing, sometimes even more than that. You should use TinyChat and not generate() because generate() is slow.
I have seen this error before, but I'm not quite sure why it happens. Happened to me with the 7B model and 13B LLaMa 2 models as per my memory....
> Hi Authors, > > Any plans to release Vicuna-1.5 quantized weights? Thanks Hi @mmaaz60, do you have access to a GPU? If so, I believe it should be easy...
> not working with FastChat. I see. This may be the fault of FastChat and not AWQ. Did you try TinyChat?