eval-scope
eval-scope copied to clipboard
[🎯 Roadmap] EvalScope Roadmap
English Version
Planned Benchmarks Support
1. Agent
- [x] 𝜏²-Bench #959
- [ ] Terminal-Bench
2. Code
- [ ] Multi-E
- [x] SciCode
- [x] SWE-Bench #976
3. Instruction Following
- [x] IFBench
4. Vision Language
- [ ] Refcoco
- [ ] CC-OCR
5. Audio
- [ ] fleurs
Features
- [ ] Performance Testing Enhancement: Support dynamic concurrency adjustment and automatic testing of model service metrics including minimum latency, TTFT (Time To First Token), and maximum throughput
- [x] Extended Evaluation Metrics: Add support for more evaluation metrics, including cons@k, G-pass@k, etc.
- [x] Function Call & Tool Use: Add support for evaluating customized scenarios of function-call and tool-use
- [ ] Prompt Management Optimization: Improve prompt management to facilitate setting different prompts for benchmarks
- [ ] Safety Benchmarks: Support safety-related benchmarks (suggestions for datasets are welcome)
- [ ] UI Development: Develop an interactive UI interface for visual model evaluation (long-term goal)
- [ ] Benchmark Collection: More comprehensive support of benchmarking Collection for evaluating indexed benchmark suites
- [ ] Stress testing support for embedding model services
Bug Fixes
- Embedding Model Evaluation: Fix benchmark misalignment issue in embedding model evaluation
Issue: https://github.com/modelscope/evalscope/issues/753 - RAG Evaluation: Fix the issue where evaluation sets cannot be automatically constructed in rageval
Issue: https://github.com/modelscope/evalscope/issues/859
中文版本
计划支持的基准测试
1. Agent(智能体)
- [x] 𝜏²-Bench #959
- [ ] Terminal-Bench
2. Code(代码)
- [ ] Multi-E
- [x] SciCode
- [x] SWE-Bench #976
3. Instruction Following(指令遵循)
- [x] IFBench
4. Vision Language(视觉语言)
- [ ] Refcoco
- [ ] CC-OCR
5. Audio(音频)
- [ ] fleurs
功能特性
- [ ] 性能测试增强:支持动态调整并发,自动测试模型服务的最低时延、TTFT(首字时延)、最高吞吐量等指标
- [x] 扩展评测指标:支持更多的评测指标,包括 cons@k、G-pass@k 等
- [x] 函数调用与工具使用:支持评测自定义场景的函数调用(function-call)和工具使用(tool-use)
- [ ] Prompt 管理优化:优化 prompt 管理,方便为不同 benchmark 设置不同的 prompt
- [ ] 安全基准测试:支持 safety 相关 benchmark(欢迎提供想要支持的数据集)
- [ ] UI 界面开发:开发 UI 交互界面,用可视化的方式进行模型评测(长期目标)
- [ ] 基准测试集合:更全面地支持 Benchmarking Collection,用于评测基准测试套件
- [ ] 支持embedding模型服务的压测
Bug 修复
- 嵌入模型评测:修复 embedding 模型评测存在 benchmark 不对齐的问题
Issue:https://github.com/modelscope/evalscope/issues/753 - RAG 评测:修复 rageval 存在无法自动构建评测集的问题
Issue:https://github.com/modelscope/evalscope/issues/859