cmunley1
cmunley1
above is actually reward hacking by calling more and more tools, changing reward structure to exact match.
I think we could just treat as system message
Unsloth currently [does not support custom rollout function](https://github.com/unslothai/unsloth/issues/3573) in their patched version of TRL GRPOTrainer it seems, making it difficult to fully use NeMo Gym as a rollout tool. We...
Hey @mmathew23 do you have a timeline for custom rollout function? For vllm server mode, I think that operating like trl is sufficient, but an async vllm engine with openai...
TRL has a [custom rollout function](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L214) and [vllm server mode](https://github.com/huggingface/trl/blob/main/trl/scripts/vllm_serve.py) that makes the integration easier. The vllm server is not a typical AsyncLLMEngine, it does not have openai chat completions/responses...
I took a stab at this [here](https://github.com/NVIDIA-NeMo/Gym/compare/main...cmunley1/reload ) It seems to work but not tested extensively