Roadmap for v0.6.0

Open zimmski opened this issue 1 year ago • 0 comments

Tasks/Goals:

[x] Development & Management 🛠️
- [x] Demo scrip to run models sequentially in separate evaluations on the "light" repository by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/189
[x] Documentation 📚
- [x] Document roadmaps and release schedule by @bauersimon https://github.com/symflower/eval-dev-quality/pull/196
[x] Evaluation ⏱️
- [x] Isolated Execution https://github.com/symflower/eval-dev-quality/issues/198, https://github.com/symflower/eval-dev-quality/issues/17
  - [x] Docker Support
    - [x] Build Docker image for every release by @Munsio https://github.com/symflower/eval-dev-quality/pull/199
    - [x] Docker evaluation runtime by @Munsio https://github.com/symflower/eval-dev-quality/pull/211, https://github.com/symflower/eval-dev-quality/pull/238, https://github.com/symflower/eval-dev-quality/pull/234, https://github.com/symflower/eval-dev-quality/pull/252
    - [x] Parallel execution of containerized evaluations by @Munsio https://github.com/symflower/eval-dev-quality/pull/221
    - [x] Run docker image generation on each push by @Munsio https://github.com/symflower/eval-dev-quality/pull/247
    - [x] fix, Use main revision docker tag by default by @Munsio https://github.com/symflower/eval-dev-quality/pull/249, https://github.com/symflower/eval-dev-quality/issues/242
    - [x] fix, Add commit revision to docker and reports by @Munsio https://github.com/symflower/eval-dev-quality/issues/207, https://github.com/symflower/eval-dev-quality/pull/255
    - [x] fix, IO error when multiple Containers use the same result path by @Munsio https://github.com/symflower/eval-dev-quality/issues/219, https://github.com/symflower/eval-dev-quality/issues/273, https://github.com/symflower/eval-dev-quality/pull/274
    - [x] Test docker in GitHub Actions by @Munsio https://github.com/symflower/eval-dev-quality/issues/224, https://github.com/symflower/eval-dev-quality/pull/260
    - [x] fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio https://github.com/symflower/eval-dev-quality/pull/290
    - [x] fix, Pass environment tokes into container by @Munsio https://github.com/symflower/eval-dev-quality/pull/250
    - [x] fix, Use a pinned Java 11 version by @Munsio https://github.com/symflower/eval-dev-quality/pull/279
    - [x] Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio https://github.com/symflower/eval-dev-quality/issues/302, https://github.com/symflower/eval-dev-quality/pull/308
  - [x] Kubernetes Support
    - [x] Kubernetes evaluation runtime by @Munsio https://github.com/symflower/eval-dev-quality/pull/231
    - [x] Copy back results from the cluster to the initial host by @Munsio https://github.com/symflower/eval-dev-quality/pull/272
    - [x] fix, Only use valid characters in Kubernetes job names by @Munsio https://github.com/symflower/eval-dev-quality/pull/292
- [x] Timeouts for test execution and symflower test generation by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/167, https://github.com/symflower/eval-dev-quality/issues/185, https://github.com/symflower/eval-dev-quality/issues/232, https://github.com/symflower/eval-dev-quality/issues/, https://github.com/symflower/eval-dev-quality/pull/277, https://github.com/symflower/eval-dev-quality/pull/267, https://github.com/symflower/eval-dev-quality/pull/188
- [x] Clarify prompt that code responses must be in code fences by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/43, https://github.com/symflower/eval-dev-quality/issues/257, https://github.com/symflower/eval-dev-quality/pull/259
- [x] fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski https://github.com/symflower/eval-dev-quality/pull/172
[x] Models 🤖
- [x] Pull Ollama models if they are selected for evaluation by @Munsio https://github.com/symflower/eval-dev-quality/issues/283, https://github.com/symflower/eval-dev-quality/pull/284
- [x] Model Selection
  - [x] Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon https://github.com/symflower/eval-dev-quality/issues/126, https://github.com/symflower/eval-dev-quality/pull/288
  - [x] Exclude the perplexicty online models because they have a "per request" cost https://github.com/symflower/eval-dev-quality/pull/288 (automatically excluded as online models)
  - [x] Additional Models
    - [x] Snowflake
    - [x] DeepSeek V2
    - [x] CodeQwen 7B
    - [x] Gemma 2
    - [x] Cohere Aya
    - [x] Yi 1.5
    - [x] Phi 3
    - [x] Falcon
    - [x] Mistral 7B 0.3
    - [x] Codegemma
- [x] fix, Retry openrouter models query cause it sometimes just errors by @bauersimon https://github.com/symflower/eval-dev-quality/issues/186, https://github.com/symflower/eval-dev-quality/pull/191
- [x] fix, Default to all repositories if none are explicitly selected by @bauersimon https://github.com/symflower/eval-dev-quality/issues/163, https://github.com/symflower/eval-dev-quality/pull/182
- [x] fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/225, https://github.com/symflower/eval-dev-quality/pull/269
- [x] fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/152, https://github.com/symflower/eval-dev-quality/pull/268
[x] Reports & Metrics 🗒️
- [x] Logging
  - [x] refactor, Structural logging by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/245
  - [x] Store model responses in separate files for easier lookup by @ahumenberger https://github.com/symflower/eval-dev-quality/issues/181, https://github.com/symflower/eval-dev-quality/pull/278
  - [x] Store coverage objects by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/223
- [x] Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/237, https://github.com/symflower/eval-dev-quality/pull/243
- [x] refactor, Abstract the storage of assessments by @ahumenberger https://github.com/symflower/eval-dev-quality/issues/169, https://github.com/symflower/eval-dev-quality/pull/178
- [x] fix, Do not overwrite results but create a separate result directory by @bauersimon https://github.com/symflower/eval-dev-quality/issues/176, https://github.com/symflower/eval-dev-quality/pull/179
- [x] New report subcommand for postprocessing report data
  - [x] report subcommand to compare multiple evaluations into one by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/205, https://github.com/symflower/eval-dev-quality/pull/271
  - [x] Let report command also combine markdown reports by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/258
- [x] Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility https://github.com/symflower/eval-dev-quality/issues/282
  - [x] Store models for the evaluation in JSON configuration report by @bauersimon https://github.com/symflower/eval-dev-quality/pull/285
  - [x] Store repositories for the evaluation in JSON configuration report by @bauersimon https://github.com/symflower/eval-dev-quality/pull/287
  - [x] Load models and repositories that were used from JSON configuration by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/291
- [x] Report maximum of executable files by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/215, https://github.com/symflower/eval-dev-quality/pull/261
- [x] Experiment with human-readable model names and costs to prepare for data visualization
  - [x] Generate the summed model files from the evaluation.csv by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/241
  - [x] Extract human-readable names of models by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/206, https://github.com/symflower/eval-dev-quality/pull/217
  - [x] Extract model costs by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/210, https://github.com/symflower/eval-dev-quality/pull/216
  - [x] Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/256
- [x] Use new Symflower version which reduces error output of the "fix" command by @bauersimon in #323
[x] Operating Systems 🖥️
- [x] More tests for Windows
  - [x] Explicitly test Java test path logic on Windows by @bauersimon https://github.com/symflower/eval-dev-quality/issues/159, https://github.com/symflower/eval-dev-quality/pull/184
  - [x] Extend temporary repository tests to Windows by @bauersimon https://github.com/symflower/eval-dev-quality/issues/141
[x] Tools 🧰
- [x] symflower fix auto-repair of common LLM mistakes
  - [x] Integrate symflower fix into evaluation by @ruiAzevedo19, @bauersimon https://github.com/symflower/eval-dev-quality/issues/213, https://github.com/symflower/eval-dev-quality/pull/229
  - [x] Do not run symflower fix when there is a timeout of the LLM by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/232, https://github.com/symflower/eval-dev-quality/pull/236
  - [x] Update symflower to latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio https://github.com/symflower/eval-dev-quality/pull/294, https://github.com/symflower/eval-dev-quality/pull/303
[x] Tasks 🔢
- [x] Infrastructure for different Task types
  - [x] Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/197, https://github.com/symflower/eval-dev-quality/issues/165, https://github.com/symflower/eval-dev-quality/pull/166
  - [x] fix, CSV header missing the task identifier by @bauersimon https://github.com/symflower/eval-dev-quality/issues/187, https://github.com/symflower/eval-dev-quality/pull/190
  - [x] Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/160, https://github.com/symflower/eval-dev-quality/pull/162
  - [x] refactor, Share logging setup between multiple tasks by @bauersimon https://github.com/symflower/eval-dev-quality/issues/200, https://github.com/symflower/eval-dev-quality/pull/202
  - [x] fix, Missing return statements when checking model capabilities by @bauersimon https://github.com/symflower/eval-dev-quality/pull/239
  - [x] Validate task repositories before evaluation by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/263, https://github.com/symflower/eval-dev-quality/pull/265, https://github.com/symflower/eval-dev-quality/pull/306
- [x] New task types
  - [x] Evaluation task for code repair by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/168, https://github.com/symflower/eval-dev-quality/pull/170, https://github.com/symflower/eval-dev-quality/pull/192
    - [x] fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/281
    - [x] fix, Correct test value for "variable unknown" code repair task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/212
    - [x] fix, Score with passing tests in code-repair task cause coverage can be cheated by @bauersimon https://github.com/symflower/eval-dev-quality/issues/320, https://github.com/symflower/eval-dev-quality/pull/321
  - [x] Evaluation task for transpilation (Go->Java and Java->Go) by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/201, https://github.com/symflower/eval-dev-quality/pull/246, https://github.com/symflower/eval-dev-quality/pull/226
    - [x] Early merger for transpilation task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/264
- [x] fix, Make Java Knapsack easier to solve by reducing Java specifics by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/230, https://github.com/symflower/eval-dev-quality/pull/262
- [x] Internal management of Testdata repositories as temporary Git repositories
  - [x] fix, Create temporary repositories just once by @bauersimon https://github.com/symflower/eval-dev-quality/issues/157, https://github.com/symflower/eval-dev-quality/pull/180
  - [x] fix, Fail tests immediately if outdated tools are installed by @bauersimon https://github.com/symflower/eval-dev-quality/issues/156, https://github.com/symflower/eval-dev-quality/pull/171
- [x] fix, Clarify Java build files to use proper version as required by Maven by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/270, https://github.com/symflower/eval-dev-quality/pull/275

Release version of this roadmap issue:

❓ When should a release happen? Check the README!

[ ] Do a full evaluation with the version
- [x] Exclude certain Openrouter models by default
  - [x] nitro cause they are just faster
  - [x] extended cause longer context windows don't matter for our tasks
  - [x] free and auto cause these are just "aliases" for existing models
- [x] Exclude special-purpose models
  - [x] Vision models
  - [x] Roleplay and creative writing models
  - [x] Classification models
  - [x] Models with internet access (usually denoted by -online suffix)
  - [x] Models with extended context windows (usually denoted by -1234K suffix)
- [x] Always prefer fine tuned (-instruct, -chat) models over a plain base model
[x] Tag version (tag can be moved in case important merges happen afterwards)
[x] For all issues of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
- Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
[x] For all PRs of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
- Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
[x] Search all issues for ...
- [x] Unassigned issues that are closed, and assign them someone
- [x] Issues without a milestone, and assign them a milestone
- [x] Issues without a label, and assign them at least one label
[x] Write the release notes:
- [x] Use the tasks that are already there for the release note outline
- [x] Add highlighted features based on the done tasks, sort by how many users would use the feature
[x] Do the release
- [x] With the release notes
- [x] Set as latest release
[x] Prepare the next roadmap
- [x] Create a milestone for the next release
- [x] Create a new roadmap issue for the next release
  - [x] Move all open tasks/TODOs from this roadmap issue to the next roadmap issue.
  - [x] Move every comment of this roadmap issue as a TODO to the next roadmap issue. Mark when done with a :rocket: emoji.
[ ] Blog post containing evaluation results, new features and learnings
- [ ] Update README with blog post link and new header image
- [ ] Update repository link with blog post link
- [ ] https://github.com/symflower/eval-dev-quality/discussions
  - [ ] Remove the previous announcements
  - [ ] Add a "Deep dive: $blog-post-title" announcement for the blog post
  - [ ] Add a "v$version: $summary-of-highlights" announcement for the release
[ ] Announce release
[ ] Eat cake 🎂

Leftover TODOs were moved to https://github.com/symflower/eval-dev-quality/issues/301.

Jun 17 '24 07:06 zimmski