eval-dev-quality icon indicating copy to clipboard operation
eval-dev-quality copied to clipboard

Roadmap for v0.6.0

Open zimmski opened this issue 1 year ago • 0 comments

Tasks/Goals:

  • [x] Development & Management 🛠️
    • [x] Demo scrip to run models sequentially in separate evaluations on the "light" repository by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/189
  • [x] Documentation 📚
    • [x] Document roadmaps and release schedule by @bauersimon https://github.com/symflower/eval-dev-quality/pull/196
  • [x] Evaluation ⏱️
    • [x] Isolated Execution https://github.com/symflower/eval-dev-quality/issues/198, https://github.com/symflower/eval-dev-quality/issues/17
      • [x] Docker Support
        • [x] Build Docker image for every release by @Munsio https://github.com/symflower/eval-dev-quality/pull/199
        • [x] Docker evaluation runtime by @Munsio https://github.com/symflower/eval-dev-quality/pull/211, https://github.com/symflower/eval-dev-quality/pull/238, https://github.com/symflower/eval-dev-quality/pull/234, https://github.com/symflower/eval-dev-quality/pull/252
        • [x] Parallel execution of containerized evaluations by @Munsio https://github.com/symflower/eval-dev-quality/pull/221
        • [x] Run docker image generation on each push by @Munsio https://github.com/symflower/eval-dev-quality/pull/247
        • [x] fix, Use main revision docker tag by default by @Munsio https://github.com/symflower/eval-dev-quality/pull/249, https://github.com/symflower/eval-dev-quality/issues/242
        • [x] fix, Add commit revision to docker and reports by @Munsio https://github.com/symflower/eval-dev-quality/issues/207, https://github.com/symflower/eval-dev-quality/pull/255
        • [x] fix, IO error when multiple Containers use the same result path by @Munsio https://github.com/symflower/eval-dev-quality/issues/219, https://github.com/symflower/eval-dev-quality/issues/273, https://github.com/symflower/eval-dev-quality/pull/274
        • [x] Test docker in GitHub Actions by @Munsio https://github.com/symflower/eval-dev-quality/issues/224, https://github.com/symflower/eval-dev-quality/pull/260
        • [x] fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio https://github.com/symflower/eval-dev-quality/pull/290
        • [x] fix, Pass environment tokes into container by @Munsio https://github.com/symflower/eval-dev-quality/pull/250
        • [x] fix, Use a pinned Java 11 version by @Munsio https://github.com/symflower/eval-dev-quality/pull/279
        • [x] Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio https://github.com/symflower/eval-dev-quality/issues/302, https://github.com/symflower/eval-dev-quality/pull/308
      • [x] Kubernetes Support
        • [x] Kubernetes evaluation runtime by @Munsio https://github.com/symflower/eval-dev-quality/pull/231
        • [x] Copy back results from the cluster to the initial host by @Munsio https://github.com/symflower/eval-dev-quality/pull/272
        • [x] fix, Only use valid characters in Kubernetes job names by @Munsio https://github.com/symflower/eval-dev-quality/pull/292
    • [x] Timeouts for test execution and symflower test generation by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/167, https://github.com/symflower/eval-dev-quality/issues/185, https://github.com/symflower/eval-dev-quality/issues/232, https://github.com/symflower/eval-dev-quality/issues/, https://github.com/symflower/eval-dev-quality/pull/277, https://github.com/symflower/eval-dev-quality/pull/267, https://github.com/symflower/eval-dev-quality/pull/188
    • [x] Clarify prompt that code responses must be in code fences by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/43, https://github.com/symflower/eval-dev-quality/issues/257, https://github.com/symflower/eval-dev-quality/pull/259
    • [x] fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski https://github.com/symflower/eval-dev-quality/pull/172
  • [x] Models 🤖
    • [x] Pull Ollama models if they are selected for evaluation by @Munsio https://github.com/symflower/eval-dev-quality/issues/283, https://github.com/symflower/eval-dev-quality/pull/284
    • [x] Model Selection
      • [x] Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon https://github.com/symflower/eval-dev-quality/issues/126, https://github.com/symflower/eval-dev-quality/pull/288
      • [x] Exclude the perplexicty online models because they have a "per request" cost https://github.com/symflower/eval-dev-quality/pull/288 (automatically excluded as online models)
      • [x] Additional Models
        • [x] Snowflake
        • [x] DeepSeek V2
        • [x] CodeQwen 7B
        • [x] Gemma 2
        • [x] Cohere Aya
        • [x] Yi 1.5
        • [x] Phi 3
        • [x] Falcon
        • [x] Mistral 7B 0.3
        • [x] Codegemma
    • [x] fix, Retry openrouter models query cause it sometimes just errors by @bauersimon https://github.com/symflower/eval-dev-quality/issues/186, https://github.com/symflower/eval-dev-quality/pull/191
    • [x] fix, Default to all repositories if none are explicitly selected by @bauersimon https://github.com/symflower/eval-dev-quality/issues/163, https://github.com/symflower/eval-dev-quality/pull/182
    • [x] fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/225, https://github.com/symflower/eval-dev-quality/pull/269
    • [x] fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/152, https://github.com/symflower/eval-dev-quality/pull/268
  • [x] Reports & Metrics 🗒️
    • [x] Logging
      • [x] refactor, Structural logging by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/245
      • [x] Store model responses in separate files for easier lookup by @ahumenberger https://github.com/symflower/eval-dev-quality/issues/181, https://github.com/symflower/eval-dev-quality/pull/278
      • [x] Store coverage objects by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/223
    • [x] Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/237, https://github.com/symflower/eval-dev-quality/pull/243
    • [x] refactor, Abstract the storage of assessments by @ahumenberger https://github.com/symflower/eval-dev-quality/issues/169, https://github.com/symflower/eval-dev-quality/pull/178
    • [x] fix, Do not overwrite results but create a separate result directory by @bauersimon https://github.com/symflower/eval-dev-quality/issues/176, https://github.com/symflower/eval-dev-quality/pull/179
    • [x] New report subcommand for postprocessing report data
      • [x] report subcommand to compare multiple evaluations into one by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/205, https://github.com/symflower/eval-dev-quality/pull/271
      • [x] Let report command also combine markdown reports by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/258
    • [x] Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility https://github.com/symflower/eval-dev-quality/issues/282
      • [x] Store models for the evaluation in JSON configuration report by @bauersimon https://github.com/symflower/eval-dev-quality/pull/285
      • [x] Store repositories for the evaluation in JSON configuration report by @bauersimon https://github.com/symflower/eval-dev-quality/pull/287
      • [x] Load models and repositories that were used from JSON configuration by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/291
    • [x] Report maximum of executable files by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/215, https://github.com/symflower/eval-dev-quality/pull/261
    • [x] Experiment with human-readable model names and costs to prepare for data visualization
      • [x] Generate the summed model files from the evaluation.csv by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/241
      • [x] Extract human-readable names of models by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/206, https://github.com/symflower/eval-dev-quality/pull/217
      • [x] Extract model costs by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/210, https://github.com/symflower/eval-dev-quality/pull/216
      • [x] Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/256
    • [x] Use new Symflower version which reduces error output of the "fix" command by @bauersimon in #323
  • [x] Operating Systems 🖥️
    • [x] More tests for Windows
      • [x] Explicitly test Java test path logic on Windows by @bauersimon https://github.com/symflower/eval-dev-quality/issues/159, https://github.com/symflower/eval-dev-quality/pull/184
      • [x] Extend temporary repository tests to Windows by @bauersimon https://github.com/symflower/eval-dev-quality/issues/141
  • [x] Tools 🧰
    • [x] symflower fix auto-repair of common LLM mistakes
      • [x] Integrate symflower fix into evaluation by @ruiAzevedo19, @bauersimon https://github.com/symflower/eval-dev-quality/issues/213, https://github.com/symflower/eval-dev-quality/pull/229
      • [x] Do not run symflower fix when there is a timeout of the LLM by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/232, https://github.com/symflower/eval-dev-quality/pull/236
      • [x] Update symflower to latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio https://github.com/symflower/eval-dev-quality/pull/294, https://github.com/symflower/eval-dev-quality/pull/303
  • [x] Tasks 🔢
    • [x] Infrastructure for different Task types
      • [x] Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger https://github.com/symflower/eval-dev-quality/pull/197, https://github.com/symflower/eval-dev-quality/issues/165, https://github.com/symflower/eval-dev-quality/pull/166
      • [x] fix, CSV header missing the task identifier by @bauersimon https://github.com/symflower/eval-dev-quality/issues/187, https://github.com/symflower/eval-dev-quality/pull/190
      • [x] Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/160, https://github.com/symflower/eval-dev-quality/pull/162
      • [x] refactor, Share logging setup between multiple tasks by @bauersimon https://github.com/symflower/eval-dev-quality/issues/200, https://github.com/symflower/eval-dev-quality/pull/202
      • [x] fix, Missing return statements when checking model capabilities by @bauersimon https://github.com/symflower/eval-dev-quality/pull/239
      • [x] Validate task repositories before evaluation by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/263, https://github.com/symflower/eval-dev-quality/pull/265, https://github.com/symflower/eval-dev-quality/pull/306
    • [x] New task types
      • [x] Evaluation task for code repair by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/168, https://github.com/symflower/eval-dev-quality/pull/170, https://github.com/symflower/eval-dev-quality/pull/192
        • [x] fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/281
        • [x] fix, Correct test value for "variable unknown" code repair task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/212
        • [x] fix, Score with passing tests in code-repair task cause coverage can be cheated by @bauersimon https://github.com/symflower/eval-dev-quality/issues/320, https://github.com/symflower/eval-dev-quality/pull/321
      • [x] Evaluation task for transpilation (Go->Java and Java->Go) by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/201, https://github.com/symflower/eval-dev-quality/pull/246, https://github.com/symflower/eval-dev-quality/pull/226
        • [x] Early merger for transpilation task by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/pull/264
    • [x] fix, Make Java Knapsack easier to solve by reducing Java specifics by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/230, https://github.com/symflower/eval-dev-quality/pull/262
    • [x] Internal management of Testdata repositories as temporary Git repositories
      • [x] fix, Create temporary repositories just once by @bauersimon https://github.com/symflower/eval-dev-quality/issues/157, https://github.com/symflower/eval-dev-quality/pull/180
      • [x] fix, Fail tests immediately if outdated tools are installed by @bauersimon https://github.com/symflower/eval-dev-quality/issues/156, https://github.com/symflower/eval-dev-quality/pull/171
    • [x] fix, Clarify Java build files to use proper version as required by Maven by @ruiAzevedo19 https://github.com/symflower/eval-dev-quality/issues/270, https://github.com/symflower/eval-dev-quality/pull/275

Release version of this roadmap issue:

❓ When should a release happen? Check the README!

  • [ ] Do a full evaluation with the version
    • [x] Exclude certain Openrouter models by default
      • [x] nitro cause they are just faster
      • [x] extended cause longer context windows don't matter for our tasks
      • [x] free and auto cause these are just "aliases" for existing models
    • [x] Exclude special-purpose models
      • [x] Vision models
      • [x] Roleplay and creative writing models
      • [x] Classification models
      • [x] Models with internet access (usually denoted by -online suffix)
      • [x] Models with extended context windows (usually denoted by -1234K suffix)
    • [x] Always prefer fine tuned (-instruct, -chat) models over a plain base model
  • [x] Tag version (tag can be moved in case important merges happen afterwards)
  • [x] For all issues of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
    • Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
  • [x] For all PRs of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
    • Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
  • [x] Search all issues for ...
    • [x] Unassigned issues that are closed, and assign them someone
    • [x] Issues without a milestone, and assign them a milestone
    • [x] Issues without a label, and assign them at least one label
  • [x] Write the release notes:
    • [x] Use the tasks that are already there for the release note outline
    • [x] Add highlighted features based on the done tasks, sort by how many users would use the feature
  • [x] Do the release
    • [x] With the release notes
    • [x] Set as latest release
  • [x] Prepare the next roadmap
    • [x] Create a milestone for the next release
    • [x] Create a new roadmap issue for the next release
      • [x] Move all open tasks/TODOs from this roadmap issue to the next roadmap issue.
      • [x] Move every comment of this roadmap issue as a TODO to the next roadmap issue. Mark when done with a :rocket: emoji.
  • [ ] Blog post containing evaluation results, new features and learnings
    • [ ] Update README with blog post link and new header image
    • [ ] Update repository link with blog post link
    • [ ] https://github.com/symflower/eval-dev-quality/discussions
      • [ ] Remove the previous announcements
      • [ ] Add a "Deep dive: $blog-post-title" announcement for the blog post
      • [ ] Add a "v$version: $summary-of-highlights" announcement for the release
  • [ ] Announce release
  • [ ] Eat cake 🎂

Leftover TODOs were moved to https://github.com/symflower/eval-dev-quality/issues/301.

zimmski avatar Jun 17 '24 07:06 zimmski