benchmark program for LIMO&AIME
Checklist
- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.
Motivation
During the test for https://github.com/sgl-project/sglang/issues/3615, I have noticed that the benchmark script (bench_serving.py) provided with sglang only supports basic datasets like ShareGPT for evaluation. The benchmarking scripts for LIMO and AIME datasets are missing. Could you please provide evaluation scripts for the LIMO and AIME datasets? If not, I would be happy to contribute by adding support for these datasets to facilitate future evaluations.
Related resources
No response
cc @zhaochenyang20 @jhinpan
Hi @tanzelin430 , we don't have evaluation scripts for LIMO and AIME. Feel free to raise a PR~
@tanzelin430 contact @simveit for help.
Please see this issue for guidance. Also you can contact me on Slack (Simon V). @zhaochenyang20 I think we can close this issue in favor of the one i wrote.