LongBench icon indicating copy to clipboard operation
LongBench copied to clipboard

LongBench v2 Leaderboard Submission Request: Qwen2.5-14B & Gemini2.0 Flash Experimental Results

Open iclal07 opened this issue 10 months ago • 3 comments

Dear LongBench Team,

I have conducted an extensive evaluation using Qwen2.5-14B and Gemini2.0 Flash Experimental models on LongBench v2.
I forked the repository and extended the evaluation pipeline to include these models. Here is my forked repo:
🔗 Forked Repository

📊 Key Results:

Model Params Context Overall (%) Easy (%) Hard (%) Short (%) Medium (%) Long (%)
Qwen2.5-14B (w/ CoT) 14B 1M 37.4 42.7 34.1 47.2 33.0 29.06
Qwen2.5-14B 14B 1M 42.8 50.8 37.9 48.9 41.3 35.5
Gemini-2.0-Flash-Exp (w/ CoT) 14B 1M 48.6 52.5 46.2 48.9 49.8 44.6
Gemini-2.0-Flash-Exp 14B 1M 45.7 49.4 43.4 49.4 42.3 46.6

Additionally, I visualized my results with radar charts, showcasing performance across multiple evaluation dimensions.

🛠️ Modifications & Enhancements:

  • I added support for Gemini API in pred.py, allowing seamless integration.
  • Introduced random sleep intervals to prevent API request limits from being exceeded.
  • Optimized parameters for better model inference efficiency.

❓ Request for Submission

I would like to submit these results to the official LongBench v2 leaderboard.
Could you please guide me on the submission process? Should I open a Pull Request (PR) with the updated results, or is there another preferred method?

I appreciate your time and contributions to the community!

Best regards,
NewMind AI Team [email protected] - [email protected]

iclal07 avatar Feb 03 '25 11:02 iclal07

Hi, great work on getting these models added!! I have also been approaching something similar but have ran into reproducibility issues. I opened issue #111 to discuss it.

I was wondering if you would also be able to evaluate across the openai models on your local install to see if you have the same issue I described in #111?

Hisham-Cohere avatar Feb 07 '25 00:02 Hisham-Cohere

Hey, thanks for your submission! Our author team have already evaluated Gemini-2.0-Flash-Exp and the results are released on https://longbench2.github.io/. We will validate your evaluation results on Qwen2.5-14B and update the leaderboard.

bys0318 avatar Feb 13 '25 10:02 bys0318

Could you please share how you evaluated the Gemini-2.0-Flash-Exp model? specifically how we could truncate the model input and the decoding parameters used? We also evaluated the Gemini-2.0-Flash-Exp model and truncated the input to 800,000 tokens (leaving the rest budget for model responses and system prompt, etc) while unable to replicate the leaderboard result, wondering if there is something we might miss during the inference.

Thanks a lot for your time!

xuandif-cmu avatar Mar 05 '25 22:03 xuandif-cmu