Casper comments

Results 293 comments of


                                            Casper

Support 3-bit and 2-bit quantization with the FLUTE kernel.

Hi @radi-cho, I do find it interesting to add support for lower-bit quantization. The only caveat, especially for 2-bit, is that extreme low-bit quantized models may need more extensive methods...

Some testing from me

This environment variable fixes this issue on multi-gpu + multi-node. `export HF_HUB_ETAG_TIMEOUT=500`

[bug] GRPO timeout error in multi node

Please see if https://github.com/volcengine/verl/issues/491#issuecomment-2704116935 is the same issue causing timeout error?

Load Qwen/Qwen2-7B-Instruct-AWQ error: RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

Try removing device_map and torch_dtpye arguments and downgrade transformers to 4.47.1

awq quantization is not fully optimized yet. The speed can be slower than non-quantized models

Hi @jackNhat, AWQ models are underoptimized in vLLM. The good news is that a the `main` branch has a new optimization that enables up to 2.59x more performance - this...

Allow passing args to `apply_chat_template`

`enable_thinking` is by default True when using `apply_chat_template`. That means axolotl is basically incompatible with training Qwen3 as a non-thinking model, which may be desirable for a lot of use-cases...

Allow passing args to `apply_chat_template`

@NanoCode012 I'm not sure of the internals in axolot, but a good check is to figure out where/if `apply_chat_template` is used and then allow chat_template kwargs.

awqint4 to gguf ,ModuleNotFoundError: No module named 'awq.apply_awq'

Hi @LDLINGLINGLING. This seems to be a `llama.cpp` package in your first message. Have you tried the GGUF export from the AutoAWQ documentation and did it succeed? https://casper-hansen.github.io/AutoAWQ/examples/#gguf-export

verl v0.2.1 & v0.3 release checklist

@BearBiscuit05 See #344, I outlined the main challenge. I think it should be relatively straightforward if veRL can start using `chat` or vLLM directly adds support for tool calling in...

verl v0.2.1 & v0.3 release checklist

You should be able to replace `generate` directly with `chat`. The only problem is that we currently pass tokenized inputs into `generate` where as `chat` expects `List[ChatCompletionContentPartTextParam]` or `List[List[ChatCompletionContentPartTextParam]]`. Not...