Casper
Casper
Hi @radi-cho, I do find it interesting to add support for lower-bit quantization. The only caveat, especially for 2-bit, is that extreme low-bit quantized models may need more extensive methods...
This environment variable fixes this issue on multi-gpu + multi-node. `export HF_HUB_ETAG_TIMEOUT=500`
Please see if https://github.com/volcengine/verl/issues/491#issuecomment-2704116935 is the same issue causing timeout error?
Try removing device_map and torch_dtpye arguments and downgrade transformers to 4.47.1
Hi @jackNhat, AWQ models are underoptimized in vLLM. The good news is that a the `main` branch has a new optimization that enables up to 2.59x more performance - this...
`enable_thinking` is by default True when using `apply_chat_template`. That means axolotl is basically incompatible with training Qwen3 as a non-thinking model, which may be desirable for a lot of use-cases...
@NanoCode012 I'm not sure of the internals in axolot, but a good check is to figure out where/if `apply_chat_template` is used and then allow chat_template kwargs.
Hi @LDLINGLINGLING. This seems to be a `llama.cpp` package in your first message. Have you tried the GGUF export from the AutoAWQ documentation and did it succeed? https://casper-hansen.github.io/AutoAWQ/examples/#gguf-export
@BearBiscuit05 See #344, I outlined the main challenge. I think it should be relatively straightforward if veRL can start using `chat` or vLLM directly adds support for tool calling in...
You should be able to replace `generate` directly with `chat`. The only problem is that we currently pass tokenized inputs into `generate` where as `chat` expects `List[ChatCompletionContentPartTextParam]` or `List[List[ChatCompletionContentPartTextParam]]`. Not...