eval-dev-quality
eval-dev-quality copied to clipboard
Roadmap for v0.6.0
Tasks/Goals:
- [x] Development & Management 🛠️
- [x] Demo scrip to run models sequentially in separate evaluations on the "light" repository by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/189
- [x] Documentation 📚
- [x] Document roadmaps and release schedule by @bauersimon https://github.com/symflower/eval-dev-quality/pull/196
- [x] Evaluation ⏱️
- [x] Isolated Execution https://github.com/symflower/eval-dev-quality/issues/198, https://github.com/symflower/eval-dev-quality/issues/17
- [x] Docker Support
- [x] Build Docker image for every release by @Munsio https://github.com/symflower/eval-dev-quality/pull/199
- [x] Docker evaluation runtime by @Munsio https://github.com/symflower/eval-dev-quality/pull/211, https://github.com/symflower/eval-dev-quality/pull/238, https://github.com/symflower/eval-dev-quality/pull/234, https://github.com/symflower/eval-dev-quality/pull/252
- [x] Parallel execution of containerized evaluations by @Munsio https://github.com/symflower/eval-dev-quality/pull/221
- [x] Run docker image generation on each push by @Munsio https://github.com/symflower/eval-dev-quality/pull/247
- [x] fix, Use
mainrevision docker tag by default by @Munsio https://github.com/symflower/eval-dev-quality/pull/249, https://github.com/symflower/eval-dev-quality/issues/242 - [x] fix, Add commit revision to docker and reports by @Munsio https://github.com/symflower/eval-dev-quality/issues/207, https://github.com/symflower/eval-dev-quality/pull/255
- [x] fix, IO error when multiple Containers use the same result path by @Munsio https://github.com/symflower/eval-dev-quality/issues/219, https://github.com/symflower/eval-dev-quality/issues/273, https://github.com/symflower/eval-dev-quality/pull/274
- [x] Test docker in GitHub Actions by @Munsio https://github.com/symflower/eval-dev-quality/issues/224, https://github.com/symflower/eval-dev-quality/pull/260
- [x] fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio https://github.com/symflower/eval-dev-quality/pull/290
- [x] fix, Pass environment tokes into container by @Munsio https://github.com/symflower/eval-dev-quality/pull/250
- [x] fix, Use a pinned Java 11 version by @Munsio https://github.com/symflower/eval-dev-quality/pull/279
- [x] Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio https://github.com/symflower/eval-dev-quality/issues/302, https://github.com/symflower/eval-dev-quality/pull/308
- [x] Kubernetes Support
- [x] Kubernetes evaluation runtime by @Munsio https://github.com/symflower/eval-dev-quality/pull/231
- [x] Copy back results from the cluster to the initial host by @Munsio https://github.com/symflower/eval-dev-quality/pull/272
- [x] fix, Only use valid characters in Kubernetes job names by @Munsio https://github.com/symflower/eval-dev-quality/pull/292
- [x] Docker Support
- [x] Timeouts for test execution and
symflowertest generation by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/167, https://github.com/symflower/eval-dev-quality/issues/185, https://github.com/symflower/eval-dev-quality/issues/232, https://github.com/symflower/eval-dev-quality/issues/, https://github.com/symflower/eval-dev-quality/pull/277, https://github.com/symflower/eval-dev-quality/pull/267, https://github.com/symflower/eval-dev-quality/pull/188 - [x] Clarify prompt that code responses must be in code fences by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/43, https://github.com/symflower/eval-dev-quality/issues/257, https://github.com/symflower/eval-dev-quality/pull/259
- [x] fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski https://github.com/symflower/eval-dev-quality/pull/172
- [x] Isolated Execution https://github.com/symflower/eval-dev-quality/issues/198, https://github.com/symflower/eval-dev-quality/issues/17
- [x] Models 🤖
- [x] Pull Ollama models if they are selected for evaluation by @Munsio https://github.com/symflower/eval-dev-quality/issues/283, https://github.com/symflower/eval-dev-quality/pull/284
- [x] Model Selection
- [x] Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon https://github.com/symflower/eval-dev-quality/issues/126, https://github.com/symflower/eval-dev-quality/pull/288
- [x] Exclude the
perplexictyonline models because they have a "per request" cost https://github.com/symflower/eval-dev-quality/pull/288 (automatically excluded as online models) - [x] Additional Models
- [x] Snowflake
- [x] DeepSeek V2
- [x] CodeQwen 7B
- [x] Gemma 2
- [x] Cohere Aya
- [x] Yi 1.5
- [x] Phi 3
- [x] Falcon
- [x] Mistral 7B 0.3
- [x] Codegemma
- [x] fix, Retry openrouter models query cause it sometimes just errors by @bauersimon https://github.com/symflower/eval-dev-quality/issues/186, https://github.com/symflower/eval-dev-quality/pull/191
- [x] fix, Default to all repositories if none are explicitly selected by @bauersimon https://github.com/symflower/eval-dev-quality/issues/163, https://github.com/symflower/eval-dev-quality/pull/182
- [x] fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/225, https://github.com/symflower/eval-dev-quality/pull/269
- [x] fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/152, https://github.com/symflower/eval-dev-quality/pull/268
- [x] Reports & Metrics 🗒️
- [x] Logging
- [x] refactor, Structural logging by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/245
- [x] Store model responses in separate files for easier lookup by @ahumenberger https://github.com/symflower/eval-dev-quality/issues/181, https://github.com/symflower/eval-dev-quality/pull/278
- [x] Store coverage objects by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/223
- [x] Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/237, https://github.com/symflower/eval-dev-quality/pull/243
- [x] refactor, Abstract the storage of assessments by @ahumenberger https://github.com/symflower/eval-dev-quality/issues/169, https://github.com/symflower/eval-dev-quality/pull/178
- [x] fix, Do not overwrite results but create a separate result directory by @bauersimon https://github.com/symflower/eval-dev-quality/issues/176, https://github.com/symflower/eval-dev-quality/pull/179
- [x] New
reportsubcommand for postprocessing report data- [x]
reportsubcommand to compare multiple evaluations into one by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/205, https://github.com/symflower/eval-dev-quality/pull/271 - [x] Let
reportcommand also combine markdown reports by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/258
- [x]
- [x] Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility https://github.com/symflower/eval-dev-quality/issues/282
- [x] Store models for the evaluation in JSON configuration report by @bauersimon https://github.com/symflower/eval-dev-quality/pull/285
- [x] Store repositories for the evaluation in JSON configuration report by @bauersimon https://github.com/symflower/eval-dev-quality/pull/287
- [x] Load models and repositories that were used from JSON configuration by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/291
- [x] Report maximum of executable files by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/215, https://github.com/symflower/eval-dev-quality/pull/261
- [x] Experiment with human-readable model names and costs to prepare for data visualization
- [x] Generate the summed model files from the evaluation.csv by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/241
- [x] Extract human-readable names of models by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/206, https://github.com/symflower/eval-dev-quality/pull/217
- [x] Extract model costs by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/210, https://github.com/symflower/eval-dev-quality/pull/216
- [x] Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/256
- [x] Use new Symflower version which reduces error output of the "fix" command by @bauersimon in #323
- [x] Logging
- [x] Operating Systems 🖥️
- [x] More tests for Windows
- [x] Explicitly test Java test path logic on Windows by @bauersimon https://github.com/symflower/eval-dev-quality/issues/159, https://github.com/symflower/eval-dev-quality/pull/184
- [x] Extend temporary repository tests to Windows by @bauersimon https://github.com/symflower/eval-dev-quality/issues/141
- [x] More tests for Windows
- [x] Tools 🧰
- [x]
symflower fixauto-repair of common LLM mistakes- [x] Integrate
symflower fixinto evaluation by @ruiAzevedo19, @bauersimon https://github.com/symflower/eval-dev-quality/issues/213, https://github.com/symflower/eval-dev-quality/pull/229 - [x] Do not run
symflower fixwhen there is a timeout of the LLM by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/232, https://github.com/symflower/eval-dev-quality/pull/236 - [x] Update
symflowerto latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio https://github.com/symflower/eval-dev-quality/pull/294, https://github.com/symflower/eval-dev-quality/pull/303
- [x] Integrate
- [x]
- [x] Tasks 🔢
- [x] Infrastructure for different Task types
- [x] Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/197, https://github.com/symflower/eval-dev-quality/issues/165, https://github.com/symflower/eval-dev-quality/pull/166
- [x] fix, CSV header missing the task identifier by @bauersimon https://github.com/symflower/eval-dev-quality/issues/187, https://github.com/symflower/eval-dev-quality/pull/190
- [x] Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/160, https://github.com/symflower/eval-dev-quality/pull/162
- [x] refactor, Share logging setup between multiple tasks by @bauersimon https://github.com/symflower/eval-dev-quality/issues/200, https://github.com/symflower/eval-dev-quality/pull/202
- [x] fix, Missing return statements when checking model capabilities by @bauersimon https://github.com/symflower/eval-dev-quality/pull/239
- [x] Validate task repositories before evaluation by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/263, https://github.com/symflower/eval-dev-quality/pull/265, https://github.com/symflower/eval-dev-quality/pull/306
- [x] New task types
- [x] Evaluation task for code repair by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/168, https://github.com/symflower/eval-dev-quality/pull/170, https://github.com/symflower/eval-dev-quality/pull/192
- [x] fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/281
- [x] fix, Correct test value for "variable unknown" code repair task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/212
- [x] fix, Score with passing tests in code-repair task cause coverage can be cheated by @bauersimon https://github.com/symflower/eval-dev-quality/issues/320, https://github.com/symflower/eval-dev-quality/pull/321
- [x] Evaluation task for transpilation (Go->Java and Java->Go) by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/201, https://github.com/symflower/eval-dev-quality/pull/246, https://github.com/symflower/eval-dev-quality/pull/226
- [x] Early merger for transpilation task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/264
- [x] Evaluation task for code repair by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/168, https://github.com/symflower/eval-dev-quality/pull/170, https://github.com/symflower/eval-dev-quality/pull/192
- [x] fix, Make Java Knapsack easier to solve by reducing Java specifics by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/230, https://github.com/symflower/eval-dev-quality/pull/262
- [x] Internal management of Testdata repositories as temporary Git repositories
- [x] fix, Create temporary repositories just once by @bauersimon https://github.com/symflower/eval-dev-quality/issues/157, https://github.com/symflower/eval-dev-quality/pull/180
- [x] fix, Fail tests immediately if outdated tools are installed by @bauersimon https://github.com/symflower/eval-dev-quality/issues/156, https://github.com/symflower/eval-dev-quality/pull/171
- [x] fix, Clarify Java build files to use proper version as required by Maven by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/270, https://github.com/symflower/eval-dev-quality/pull/275
- [x] Infrastructure for different Task types
Release version of this roadmap issue:
❓ When should a release happen? Check the
README!
- [ ] Do a full evaluation with the version
- [x] Exclude certain Openrouter models by default
- [x]
nitrocause they are just faster - [x]
extendedcause longer context windows don't matter for our tasks - [x]
freeandautocause these are just "aliases" for existing models
- [x]
- [x] Exclude special-purpose models
- [x] Vision models
- [x] Roleplay and creative writing models
- [x] Classification models
- [x] Models with internet access (usually denoted by
-onlinesuffix) - [x] Models with extended context windows (usually denoted by
-1234Ksuffix)
- [x] Always prefer fine tuned (
-instruct,-chat) models over a plain base model
- [x] Exclude certain Openrouter models by default
- [x] Tag version (tag can be moved in case important merges happen afterwards)
- [x] For all issues of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
- Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
- [x] For all PRs of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
- Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
- [x] Search all issues for ...
- [x] Unassigned issues that are closed, and assign them someone
- [x] Issues without a milestone, and assign them a milestone
- [x] Issues without a label, and assign them at least one label
- [x] Write the release notes:
- [x] Use the tasks that are already there for the release note outline
- [x] Add highlighted features based on the done tasks, sort by how many users would use the feature
- [x] Do the release
- [x] With the release notes
- [x] Set as latest release
- [x] Prepare the next roadmap
- [x] Create a milestone for the next release
- [x] Create a new roadmap issue for the next release
- [x] Move all open tasks/TODOs from this roadmap issue to the next roadmap issue.
- [x] Move every comment of this roadmap issue as a TODO to the next roadmap issue. Mark when done with a :rocket: emoji.
- [ ] Blog post containing evaluation results, new features and learnings
- [ ] Update README with blog post link and new header image
- [ ] Update repository link with blog post link
- [ ] https://github.com/symflower/eval-dev-quality/discussions
- [ ] Remove the previous announcements
- [ ] Add a "Deep dive: $blog-post-title" announcement for the blog post
- [ ] Add a "v$version: $summary-of-highlights" announcement for the release
- [ ] Announce release
- [ ] Eat cake 🎂
Leftover TODOs were moved to https://github.com/symflower/eval-dev-quality/issues/301.