LongBench v2 Leaderboard Submission Request: Qwen2.5-14B & Gemini2.0 Flash Experimental Results
Dear LongBench Team,
I have conducted an extensive evaluation using Qwen2.5-14B and Gemini2.0 Flash Experimental models on LongBench v2.
I forked the repository and extended the evaluation pipeline to include these models. Here is my forked repo:
🔗 Forked Repository
📊 Key Results:
| Model | Params | Context | Overall (%) | Easy (%) | Hard (%) | Short (%) | Medium (%) | Long (%) |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-14B (w/ CoT) | 14B | 1M | 37.4 | 42.7 | 34.1 | 47.2 | 33.0 | 29.06 |
| Qwen2.5-14B | 14B | 1M | 42.8 | 50.8 | 37.9 | 48.9 | 41.3 | 35.5 |
| Gemini-2.0-Flash-Exp (w/ CoT) | 14B | 1M | 48.6 | 52.5 | 46.2 | 48.9 | 49.8 | 44.6 |
| Gemini-2.0-Flash-Exp | 14B | 1M | 45.7 | 49.4 | 43.4 | 49.4 | 42.3 | 46.6 |
Additionally, I visualized my results with radar charts, showcasing performance across multiple evaluation dimensions.
🛠️ Modifications & Enhancements:
- I added support for Gemini API in
pred.py, allowing seamless integration. - Introduced random sleep intervals to prevent API request limits from being exceeded.
- Optimized parameters for better model inference efficiency.
❓ Request for Submission
I would like to submit these results to the official LongBench v2 leaderboard.
Could you please guide me on the submission process? Should I open a Pull Request (PR) with the updated results, or is there another preferred method?
I appreciate your time and contributions to the community!
Best regards,
NewMind AI Team
[email protected] - [email protected]
Hi, great work on getting these models added!! I have also been approaching something similar but have ran into reproducibility issues. I opened issue #111 to discuss it.
I was wondering if you would also be able to evaluate across the openai models on your local install to see if you have the same issue I described in #111?
Hey, thanks for your submission! Our author team have already evaluated Gemini-2.0-Flash-Exp and the results are released on https://longbench2.github.io/. We will validate your evaluation results on Qwen2.5-14B and update the leaderboard.
Could you please share how you evaluated the Gemini-2.0-Flash-Exp model? specifically how we could truncate the model input and the decoding parameters used? We also evaluated the Gemini-2.0-Flash-Exp model and truncated the input to 800,000 tokens (leaving the rest budget for model responses and system prompt, etc) while unable to replicate the leaderboard result, wondering if there is something we might miss during the inference.
Thanks a lot for your time!