LongBench LongBench v2 Leaderboard Submission Request: Qwen2.5-14B & Gemini2.0 Flash Experimental Results

Dear LongBench Team,

I have conducted an extensive evaluation using Qwen2.5-14B and Gemini2.0 Flash Experimental models on LongBench v2.
I forked the repository and extended the evaluation pipeline to include these models. Here is my forked repo:
🔗 Forked Repository

📊 Key Results:

Model	Params	Context	Overall (%)	Easy (%)	Hard (%)	Short (%)	Medium (%)	Long (%)
Qwen2.5-14B (w/ CoT)	14B	1M	37.4	42.7	34.1	47.2	33.0	29.06
Qwen2.5-14B	14B	1M	42.8	50.8	37.9	48.9	41.3	35.5
Gemini-2.0-Flash-Exp (w/ CoT)	14B	1M	48.6	52.5	46.2	48.9	49.8	44.6
Gemini-2.0-Flash-Exp	14B	1M	45.7	49.4	43.4	49.4	42.3	46.6

Additionally, I visualized my results with radar charts, showcasing performance across multiple evaluation dimensions.

🛠️ Modifications & Enhancements:

I added support for Gemini API in pred.py, allowing seamless integration.
Introduced random sleep intervals to prevent API request limits from being exceeded.
Optimized parameters for better model inference efficiency.

❓ Request for Submission

I would like to submit these results to the official LongBench v2 leaderboard.
Could you please guide me on the submission process? Should I open a Pull Request (PR) with the updated results, or is there another preferred method?

I appreciate your time and contributions to the community!

Best regards,
NewMind AI Team [email protected] - [email protected]

Feb 03 '25 11:02 iclal07

Hi, great work on getting these models added!! I have also been approaching something similar but have ran into reproducibility issues. I opened issue #111 to discuss it.

I was wondering if you would also be able to evaluate across the openai models on your local install to see if you have the same issue I described in #111?

Feb 07 '25 00:02 Hisham-Cohere

Hey, thanks for your submission! Our author team have already evaluated Gemini-2.0-Flash-Exp and the results are released on https://longbench2.github.io/. We will validate your evaluation results on Qwen2.5-14B and update the leaderboard.

Feb 13 '25 10:02 bys0318

Could you please share how you evaluated the Gemini-2.0-Flash-Exp model? specifically how we could truncate the model input and the decoding parameters used? We also evaluated the Gemini-2.0-Flash-Exp model and truncated the input to 800,000 tokens (leaving the rest budget for model responses and system prompt, etc) while unable to replicate the leaderboard result, wondering if there is something we might miss during the inference.

Thanks a lot for your time!

Mar 05 '25 22:03 xuandif-cmu