Casper

Results 293 comments of Casper

Hi @radi-cho, I do find it interesting to add support for lower-bit quantization. The only caveat, especially for 2-bit, is that extreme low-bit quantized models may need more extensive methods...

This environment variable fixes this issue on multi-gpu + multi-node. `export HF_HUB_ETAG_TIMEOUT=500`

Please see if https://github.com/volcengine/verl/issues/491#issuecomment-2704116935 is the same issue causing timeout error?

Hi @jackNhat, AWQ models are underoptimized in vLLM. The good news is that a the `main` branch has a new optimization that enables up to 2.59x more performance - this...

`enable_thinking` is by default True when using `apply_chat_template`. That means axolotl is basically incompatible with training Qwen3 as a non-thinking model, which may be desirable for a lot of use-cases...

@NanoCode012 I'm not sure of the internals in axolot, but a good check is to figure out where/if `apply_chat_template` is used and then allow chat_template kwargs.

Hi @LDLINGLINGLING. This seems to be a `llama.cpp` package in your first message. Have you tried the GGUF export from the AutoAWQ documentation and did it succeed? https://casper-hansen.github.io/AutoAWQ/examples/#gguf-export

@BearBiscuit05 See #344, I outlined the main challenge. I think it should be relatively straightforward if veRL can start using `chat` or vLLM directly adds support for tool calling in...

You should be able to replace `generate` directly with `chat`. The only problem is that we currently pass tokenized inputs into `generate` where as `chat` expects `List[ChatCompletionContentPartTextParam]` or `List[List[ChatCompletionContentPartTextParam]]`. Not...