mobicham
mobicham
You should see the highest performance gain with batch-size =1 actually, 3-3.5x speed-up on the 4090 with 4-bit weights
torchao_int4 is the fastest for batch_size=1 with group_size=64. Gemlite is good for higher batch-sizes. If you try with gpt-fast you should get the following ont the 4090 RTX: * torchao_int4...
Thanks Kerem! We internally use vllm with ray via `LLM`, but this could be useful for people using via the openai api server indeed, unless they do it manually in...
Sounds good to me, feel free to do a PR! It's because we support different backends, not just vllm, since we also need to run other non-llm models. We have...
Closing this since now we have support via torchao: https://github.com/vllm-project/vllm/pull/19265
> It seems to be an [issue ](https://github.com/ggerganov/llama.cpp/discussions/229)with llama.cpp. So basically they say it's a problem with quantized models running with large prompts. That sounds strange because the impact of...
+1 for this please
@plotfi here's a version with Triton that works but it's very slow: ```Python @triton.jit def atomic_add_cas(ptr, value, Lock, mask=None, sem: tl.constexpr = 'release'): while tl.atomic_cas(Lock, 0, 1, sem=sem) == 1:...
By the way, bfloat16 atomic addition also crashes with Hopper in Triton.
Thank you @rationalism ! Added a few comments