eval-dev-quality
eval-dev-quality copied to clipboard
Add Qwen3, GLM4.5, and DeepSeek series as an update
Those are not in the current leaderboard but things has already changed since last quarter
- [ ] Qwen3 Coder series
- [ ] Qwen3 dense and MoE models
- [ ] Qwen3 distilled models
- [ ] GLM 4.5 both original and Air models
- [ ] DeepSeek v3.1 base
- [ ] DeepSeek-R1-0528
Other suggestions
- [ ] Kimi K2 and Kimi Coder is probably worth it even if they fail on some tasks
- [ ] GPT-OSS model series are good to comare as US representation
- [ ] maybe LG's Exaone, Meta's Llama 4, and MiniMax if there is enough time for this