sglang benchmark program for LIMO&AIME

Checklist

[x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[x] 2. Please use English, otherwise it will be closed.

Motivation

During the test for https://github.com/sgl-project/sglang/issues/3615, I have noticed that the benchmark script (bench_serving.py) provided with sglang only supports basic datasets like ShareGPT for evaluation. The benchmarking scripts for LIMO and AIME datasets are missing. Could you please provide evaluation scripts for the LIMO and AIME datasets? If not, I would be happy to contribute by adding support for these datasets to facilitate future evaluations.

Related resources

No response

Feb 19 '25 13:02 tanzelin430

cc @zhaochenyang20 @jhinpan

Feb 19 '25 21:02 Fridge003

Hi @tanzelin430 , we don't have evaluation scripts for LIMO and AIME. Feel free to raise a PR~

Feb 19 '25 22:02 Fridge003

@tanzelin430 contact @simveit for help.

Feb 20 '25 03:02 zhaochenyang20

Please see this issue for guidance. Also you can contact me on Slack (Simon V). @zhaochenyang20 I think we can close this issue in favor of the one i wrote.

Feb 20 '25 07:02 simveit