eval-scope icon indicating copy to clipboard operation
eval-scope copied to clipboard

[🎯 Roadmap] EvalScope Roadmap

Open Yunnglin opened this issue 2 months ago • 0 comments

English Version

Planned Benchmarks Support

1. Agent

  • [x] 𝜏²-Bench #959
  • [ ] Terminal-Bench

2. Code

  • [ ] Multi-E
  • [x] SciCode
  • [x] SWE-Bench #976

3. Instruction Following

  • [x] IFBench

4. Vision Language

  • [ ] Refcoco
  • [ ] CC-OCR

5. Audio

  • [ ] fleurs

Features

  • [ ] Performance Testing Enhancement: Support dynamic concurrency adjustment and automatic testing of model service metrics including minimum latency, TTFT (Time To First Token), and maximum throughput
  • [x] Extended Evaluation Metrics: Add support for more evaluation metrics, including cons@k, G-pass@k, etc.
  • [x] Function Call & Tool Use: Add support for evaluating customized scenarios of function-call and tool-use
  • [ ] Prompt Management Optimization: Improve prompt management to facilitate setting different prompts for benchmarks
  • [ ] Safety Benchmarks: Support safety-related benchmarks (suggestions for datasets are welcome)
  • [ ] UI Development: Develop an interactive UI interface for visual model evaluation (long-term goal)
  • [ ] Benchmark Collection: More comprehensive support of benchmarking Collection for evaluating indexed benchmark suites
  • [ ] Stress testing support for embedding model services

Bug Fixes

  1. Embedding Model Evaluation: Fix benchmark misalignment issue in embedding model evaluation
    Issue: https://github.com/modelscope/evalscope/issues/753
  2. RAG Evaluation: Fix the issue where evaluation sets cannot be automatically constructed in rageval
    Issue: https://github.com/modelscope/evalscope/issues/859

中文版本

计划支持的基准测试

1. Agent(智能体)

  • [x] 𝜏²-Bench #959
  • [ ] Terminal-Bench

2. Code(代码)

  • [ ] Multi-E
  • [x] SciCode
  • [x] SWE-Bench #976

3. Instruction Following(指令遵循)

  • [x] IFBench

4. Vision Language(视觉语言)

  • [ ] Refcoco
  • [ ] CC-OCR

5. Audio(音频)

  • [ ] fleurs

功能特性

  • [ ] 性能测试增强:支持动态调整并发,自动测试模型服务的最低时延、TTFT(首字时延)、最高吞吐量等指标
  • [x] 扩展评测指标:支持更多的评测指标,包括 cons@k、G-pass@k 等
  • [x] 函数调用与工具使用:支持评测自定义场景的函数调用(function-call)和工具使用(tool-use)
  • [ ] Prompt 管理优化:优化 prompt 管理,方便为不同 benchmark 设置不同的 prompt
  • [ ] 安全基准测试:支持 safety 相关 benchmark(欢迎提供想要支持的数据集)
  • [ ] UI 界面开发:开发 UI 交互界面,用可视化的方式进行模型评测(长期目标)
  • [ ] 基准测试集合:更全面地支持 Benchmarking Collection,用于评测基准测试套件
  • [ ] 支持embedding模型服务的压测

Bug 修复

  1. 嵌入模型评测:修复 embedding 模型评测存在 benchmark 不对齐的问题
    Issue:https://github.com/modelscope/evalscope/issues/753
  2. RAG 评测:修复 rageval 存在无法自动构建评测集的问题
    Issue:https://github.com/modelscope/evalscope/issues/859

Yunnglin avatar Nov 04 '25 12:11 Yunnglin