eval-dev-quality Roadmap for v0.5.0

The v0.5.0 is mainly meant for introducing more variate. There are three main goals

Introduce more logical cases, to make sure that "better models" have a bigger difference in score.
Introduce more providers so we can test models that have been request and react faster to new releases.

Tasks:

[x] Metrics
- [x] Measure processing time of queries #106 #105
- [x] Automate multiple runs for more deterministic results #109 #108
- [x] Empty responses should be marked as error responses, to indicate that they are on the same level #97
[x] Support execution of more models
- [x] Ollama support #91 #95 #96 #117 #118 #115 #27
[x] Reporting
- [x] Additional CSVs to sum up overall results, and language individual results #94 #83
- [x] fix, Y axis ticks should be readable #73
- [x] fix, Deterministic order of rows in CSV exporting #99 #98
[x] Multi OS support of eval
- [x] Support MacOS #102
- [x] Support Windows #103 #101 #104
[x] Tools
- [x] Introduce unique ID for addressing tools #122 Os-independently
- [ ] Extend symflower test with a deeper execution coverage export
  - [ ] Go
    - [x] Extract to file
    - [ ] Cover lines of tests that have exceptions
  - [ ] Java
    - [x] Extract to file
    - [ ] Cover lines of tests that have exceptions
  - [ ] Make coverage metric fair
[ ] Tasks and cases
- [ ] Introduce more cases with logic in "light" repository
  - [ ] Go
  - [ ] Java
[ ] Release
- [ ] Do a full evaluation with the new version
- [ ] Tag version
- [ ] Blog post
- [ ] Adapt README
- [ ] Announce and eat cake

TODO sort and sort out

[ ] TODO add https://github.com/symflower/eval-dev-quality/milestone/2
[ ] TODO https://github.com/symflower/eval-dev-quality/pulls?q=is%3Aopen+is%3Apr+milestone%3Av0.5.0
[ ] For non-selective evaluations exklude certain models e.g. "openrouter/auto" needs to go, because it is not a real model, is just forwarding to a model automatically https://github.com/symflower/eval-dev-quality/issues/126
[ ] Allow arbitrary URLs for API provider
[ ] AST differ https://github.com/symflower/eval-dev-quality/issues/80
[ ] Non-benchmark metrics #82
[x] Bug Empty responses should not be tested but should fail https://github.com/symflower/eval-dev-quality/blob/3e7dc8c5beab65f5958a458a593823ba5c25698e/docs/reports/v0.4.0/openrouter_databricks_dbrx-instruct/java/java/plain.log
[ ] https://github.com/symflower/eval-dev-quality/issues/81
[ ] Bug that hinders more repositories: when an LLM is evaluated to successfully do "plain" and Go fails bug Java works, then it will be blocked to do ALL repositories. In that case it should only be blocked for Go and not Java.
[ ] Take a look at https://twitter.com/SMT_Solvers/status/1783540994304066006
[ ] Clean up and extend README
- [x] Less fuzz and fluff (see Thomas' feedback)
- [x] Bring in the current blog post, its information, especially the blog post image to showcase the evaluation
- [ ] Better examples for contributions
- [ ] Overhaul explanation of "why" we need evaluation, i.e. why is it good to evaluate for an empty function that does nothing.
- [x] Readme extension: The nice thing about generating tests is that it is easy to automatically check if the result is correct. Needs to compile and provide 100% coverage. But one can only write such tests if they understand the source, so implicitly we are evaluating the language understanding of the LLM.
[ ] Automatically updated leader board for this repository: #26
- [ ] Take a look at current leaderboards and evals to know what could be interesting Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
[ ] Report generation and logs
- [ ] Remove absolute paths completely e.g. in stack traces too.
[ ] Metrics
- [ ] Automatically interpret "Extra code" #44
[ ] Java
- [ ] Log Maven commands because they can be faulty (remember that the "surefire" plugin needs a fixed version because GitHub's is tooooo old). With that it is easier to debug
- [ ] Let the Java test case for No test files actually identify and error that there are no test files (needs to be implemented in symflower test)
[ ] LLM
- [ ] Log request and response in their own files, so both can be used 1:1 (character for character) directly for debugging them
[ ] Improve LLM prompt
- [ ] Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.
- [ ] Add markers for system and user e.g. https://github.com/symflower/eval-dev-quality/pull/27#issuecomment-2058992479
- [ ] Think about a standardized way of print outputs e.g. JSON https://twitter.com/WuMinghao_nlp/status/1789094583290507626
[ ] Prepare for files with more paths
- [ ] Do not use percentage for coverage but absolute metrics where we then show percentage if needed https://github.com/symflower/eval-symflower-codegen-testing/pull/8#discussion_r1549404830
[ ] Prepare language and evaluation logic for multiple files:
- [ ] Do clean up of generated files
- [ ] Use symflower symbols to receive files
[ ] Sandboxed execution #17 e.g. with Docker as its first implementation
[ ] Do an evaluation with different temperatures
[ ] Automatic tool installation with fixed version
- [ ] Go
- [ ] Java
[ ] Evaluation tasks
- [ ] Introduce the interface for doing "evaluation tasks" so we can easily add them
- [ ] Add evaluation task for "querying the relative test file path of a relative implementation file path" e.g. "What is the test relative file path for some/implementation/file.go" ... it is "some/implementation/file_test.go" for most cases.
- [ ] Add evaluation task for transpilation Go->Java and Java->Go
- [ ] Scoring, Categorization, Bar Charts split by language.
- [ ] Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.
- [ ] Save the descriptons of the models as well: https://openrouter.ai/api/v1/models The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not
- [ ] Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4
- [ ] Write down a playbook for evaluations, e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour berak in between to not run into cached responses.
- [ ] Bar charts should have have their value on the bar. The axis values do not work that well
- [ ] Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.
- [ ] Do test file paths through
  - [ ] symflower symbols
  - [ ] Task for models
[ ] Query REAL costs of all the testing of a model: the reason this is interesting is that some models have HUGE outputs, and since more output means more costs, this should be addressed in the score.
[ ] Think about exlcuding the "perplexicty" models because they have a "per request" cost, and they are the only ones that do that.
[ ] Charts to showcase data
- [ ] Total-scores vs costs scatterplot. Result is upper-left-corner sweat spot: cheap and good results.
- [ ] Piechart of whole evaluations costs: for each LLM show how much it costs. Result is to see which LLMs are costing the most to run the eval.
[ ] Reporting and documentation on writing deep-dives
- [ ] What are results that align with expectations? what are results against expectations? E.g. are there small LLMs that are better than big ones?
- [ ] Are there big LLMs that totally fail?
- [ ] Are there small LLMs that are surprisingly good?
- [ ] What about LLMs where the commonity doesn't know that much yet: e.g. Snowflake, DBRX, ...
- [ ] Snowflake against Databricks would be a nice comparision since they align company-wise and are new
[ ] Include more models (We have the main problem that there are multiple models coming out every day. We should not wait for a "new version" of the eval, we should test these models right away and compare them. big problem: how do we promote findings?)
- [ ] Snowflake
- [ ] DeepSeek for Code https://t.co/jpGY5w0e4Y see https://twitter.com/ak_kim0/status/1786141140376285685
[ ] Move towards generated cases so models cannot integrate fixed cases to always have 100% score
[ ] Think about adding more trainings data generation features: This will also help with dynamic cases
- [ ] Heard that Snowflake Arctic is very open with how they gathered trainings data... so we see what LLM creators think and want of trainings data
[ ] Code repair
- [ ] Own task category
- [ ] 0-shot, 1-shot, ...
  - [ ] With LLM repair
  - [ ] With tool repair
[ ] Rescore existing models / eval with fixes e.g. when we do a better code repair tool, the LLM answer did not change, so we should rescore right away with new version of tool over a whole result of an eval.
[ ] Think about a commercial effort of the eval, that we can balance some of the costs that goes into maintaining this eval
[ ] Take a look at all OpenRouter's features of the API e.g. https://openrouter.ai/docs#parameters
- [ ] https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l29o97q/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button mentioned that repetition_penalty disabled helps with performance and better results for coding.
- [ ] Requested new models for the eval
  - [ ] Snowflake
  - [ ] CodeQwen 7B
  - [ ] DeepSeek
  - [ ] Nous-Hermes-2-SOLAR-10.7B; Also "Tree of Thoughts" approach might be interesting as a task https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l2d4im0/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
  - [ ] Gemma 2
  - [ ] https://x.com/osanseviero/status/1793644453007155451
  - [ ] https://x.com/osanseviero/status/1793930015047880959
[ ] Blog post about the different suffixed of models e.g. "chat" and "instruct" and eval them somehow. Idead from https://www.reddit.com/r/LocalLLaMA/comments/1bz5oyx/comment/kyrfap4/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
[ ] Categorize by parameters/experts https://www.reddit.com/r/LocalLLaMA/comments/1cdivc8/comment/l1davhv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
[ ] Maybe use https://huggingface.co/inference-endpoints/dedicated
[ ] Blog post about training a small LLM directly on HumanEval
[ ] Ensure that non-critical CLI input validation (such as unavailable models) does not panic
[ ] Look at what Paul Gauthier is doing with benchmarking of Aider (and Aider too) https://twitter.com/paulgauthier/status/1787827703158386921?s=46 seems like a perfect match for what we want to do as tasks
[ ] Look at MLX which maybe helps with execution for us https://twitter.com/loganthorneloe/status/1787845883519775120
[ ] Take a look at xLSTM https://twitter.com/HochreiterSepp/status/1788072466675335185
[ ] Compare input/output/request/... costs https://twitter.com/oleghuman/status/1786296672785420744
[ ] Take a look at eval https://twitter.com/JiaweiLiu_/status/1783959954321252697
[ ] Take a look at evaluation framework https://twitter.com/hamelhusain/status/1788936691576975382?s=61
[ ] Benchmark quantized models, beacuse they need less memory
- [ ] https://twitter.com/HaihaoShen/status/1789178048308543688
[ ] Ollama support
- [ ] Install and test Ollama on MacOS
- [ ] Install and test Ollama on Windows
[ ] even powerful models as GPT4 and Llama3 might return EOF or an error (https://github.com/symflower/eval-dev-quality/tree/105%2B108/evaluation-2024-05-14-09%3A18%3A41), give them a second chance or something: https://github.com/symflower/eval-dev-quality/issues/123
[ ] distinguish between latency (time-to-first-token) and throughput (tokens generated per second)
[ ] Allow to forward CLI commands to be evaluated: https://github.com/symflower/eval-dev-quality/pull/27#issuecomment-2077677950
[ ] Refactor Model and Provider to be in the same package https://github.com/symflower/eval-dev-quality/pull/121#discussion_r1603371915
[ ] Blog post about "non-determinism of LLMs" https://community.openai.com/t/a-question-on-determinism/8185 good starting point, and how we can make them at least more stable.
[ ] Dig through https://arxiv.org/pdf/2405.14782 thanks to https://x.com/clefourrier/status/1793913394871062970

Apr 26 '24 13:04 zimmski

CC @bauersimon

Apr 26 '24 13:04 zimmski

Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.

seems like a PR does not make sense

Apr 29 '24 12:04 bauersimon

Blogpost idea: misleading comments... how much does it take to confuse the most powerful AI? (credit to @ahumenberger)

May 27 '24 08:05 bauersimon

Blogpost idea: misleading comments... how much does it take to confuse the most powerful AI? (credit to @ahumenberger)

Maybe not only comments. What about obfuscated code, e.g. function and variables names are just random strings?

May 28 '24 05:05 ahumenberger

Take a look at https://x.com/dottxtai/status/1798443290913853770

Jun 06 '24 09:06 zimmski

Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite.

Jun 11 '24 06:06 bauersimon

Closed with #297

Jul 30 '24 12:07 Munsio