mobicham comments

Results 113 comments of


                                            mobicham

What kind of layers are optimized by torchao on a RTX 4090?

You should see the highest performance gain with batch-size =1 actually, 3-3.5x speed-up on the 4090 with 4-bit weights

What kind of layers are optimized by torchao on a RTX 4090?

torchao_int4 is the fastest for batch_size=1 with group_size=64. Gemlite is good for higher batch-sizes. If you try with gpt-fast you should get the following ont the 4090 RTX: * torchao_int4...

vLLM api server patching

Thanks Kerem! We internally use vllm with ray via `LLM`, but this could be useful for people using via the openai api server indeed, unless they do it manually in...

vLLM api server patching

Sounds good to me, feel free to do a PR! It's because we support different backends, not just vllm, since we also need to run other non-llm models. We have...

vLLM api server patching

Closing this since now we have support via torchao: https://github.com/vllm-project/vllm/pull/19265

evaluation extremely slow with llama_cpp/gguf

> It seems to be an [issue ](https://github.com/ggerganov/llama.cpp/discussions/229)with llama.cpp. So basically they say it's a problem with quantized models running with large prompts. That sounds strange because the impact of...

Nightly install

+1 for this please

Feature Request: `tl.atomic_add` for bfloat16

@plotfi here's a version with Triton that works but it's very slow: ```Python @triton.jit def atomic_add_cas(ptr, value, Lock, mask=None, sem: tl.constexpr = 'release'): while tl.atomic_cas(Lock, 0, 1, sem=sem) == 1:...

Feature Request: `tl.atomic_add` for bfloat16

By the way, bfloat16 atomic addition also crashes with Hopper in Triton.

torch.compile() the quantization method

Thank you @rationalism ! Added a few comments