eval-dev-quality icon indicating copy to clipboard operation
eval-dev-quality copied to clipboard

Features, bugs, research, ... that we are not actively working on

Open zimmski opened this issue 11 months ago • 0 comments

TODO sort and sort out:

  • [ ] Models
    • [ ] Better prompting with templates: Setting a mandatory document start #29
    • [ ] Retry with feedback and retry without feedback. #30
    • [ ] Implement "chain of thought" tasks #31
    • [ ] Deal with dependencies requested by LLMs #174
    • [ ] Evaluation run for all "good open weight models" with all available quantizations and different GPUs #209
    • [ ] Exclude openrouter models auto and flavor-of-the-week automatically in the provider
    • [ ] Rethink retry logic for LLM Providers #305
    • [ ] Openrouter Provider preferences #286
    • [ ] Include more models (We have the main problem that there are multiple models coming out every day. We should not wait for a "new version" of the eval, we should test these models right away and compare them. big problem: how do we promote findings?)
      • [ ] Nous-Hermes-2-SOLAR-10.7B; Also "Tree of Thoughts" approach might be interesting as a task
    • [ ] Maybe use https://huggingface.co/inference-endpoints/dedicated
  • [ ] Metrics & Reporting
    • [ ] Evaluation folder with date cannot be created on windows #151
    • [ ] extend the report command such that it takes result csv's and automatically
      • [ ] does the summing and aggregation (if we still want that to be a separate step)
      • [ ] finds the maximum scores for that evaluation run
      • [ ] once we have the leaderboard, we basically want to configure the repository such that we just add a model to some config somewhere and the GitHub actions run automatically and benchmark this model
      • [ ] or in a similar fashion, we just do a new release and the GitHub actions run automatically and benchmark everything for the new version
    • [ ] Automatically updated leader board for this repository: #26
      • [ ] Take a look at current leaderboards and evals to know what could be interesting Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
    • [ ] Scoring
      • [ ] Infer if a model produced "too much" code #44
      • [ ] Introduce an AST-differ that also gives metrics #80
      • [ ] Add linters where each error is a metric #81
      • [ ] Include metrics about the models for comparing models #82
      • [ ] Coverage for Java is tracked for lines, while Go is tracked for ranges #193
      • [ ] Weight "executed code" more prominently #233
      • [ ] AST differ https://github.com/symflower/eval-dev-quality/issues/80
      • [ ] Linters https://github.com/symflower/eval-dev-quality/issues/81
      • [ ] Automatically infer "Extra code" https://github.com/symflower/eval-dev-quality/issues/44
      • [ ] Figure out the "perfect" coverage score so we can display percentage of coverage reached
      • [ ] Make coverage metric fair
        • "Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite." -> only Symflower coverage will make this fair
      • [ ] distinguish between latency (time-to-first-token) and throughput (tokens generated per second)
      • [ ] Failing tests should receive a score penalty
    • [ ] Metrics
      • [ ] Track query tokens and save them to CSV #347
      • [ ] Non-benchmark metrics (cost, weights open, ...) #82
        • [ ] Save the descriptons of the models as well: https://openrouter.ai/api/v1/models The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not
      • [ ] Query REAL costs of all the testing of a model: the reason this is interesting is that some models have HUGE outputs, and since more output means more costs, this should be addressed in the score.
    • [ ] Reporting
      • [ ] Do an up-to-date leaderboard/dashboard for current models current evaluation #26
      • [ ] Bar charts should have have their value on the bar. The axis values do not work that well
      • [ ] Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.
      • [ ] Total-scores vs costs scatterplot. Result is upper-left-corner sweat spot: cheap and good results.
      • [ ] Scoring, Categorization, Bar Charts split by language.
      • [ ] Piechart of whole evaluations costs: for each LLM show how much it costs. Result is to see which LLMs are costing the most to run the eval.
      • [ ] deep-dive content
        • [ ] What are results that align with expectations? what are results against expectations? E.g. are there small LLMs that are better than big ones?
        • [ ] Are there big LLMs that totally fail?
        • [ ] Are there small LLMs that are surprisingly good?
        • [ ] What about LLMs where the commonity doesn't know that much yet: e.g. Snowflake, DBRX, ...
      • [ ] Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4
      • [ ] Categorize by parameters/experts https://www.reddit.com/r/LocalLLaMA/comments/1cdivc8/comment/l1davhv/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
      • [ ] Compare input/output/request/... costs https://twitter.com/oleghuman/status/1786296672785420744
  • [ ] Logging
    • [ ] Remove absolute paths completely e.g. in stack traces too.
    • [ ] Log request and response in their own files, so both can be used 1:1 (character for character) directly for debugging them: https://github.com/symflower/eval-dev-quality/issues/204
  • [ ] Tooling & Installation
    • [ ] CI and tools
      • [ ] InstallToolsPath is not used for test execution (make test) #93
      • [ ] Test for pulling Ollama model is flaky #135
      • [ ] Flaky CI because of corrupted Z3 installation #107
      • [ ] Follow up: Ollama Support #100
      • [ ] ollama_llama_server and other background processes we start must be killed on CTRL+C #164
      • [ ] Enable Ruby tests in Windows CI #334
      • [ ] Move Dependency installation of Docker into multistage builds #319
    • [ ] Rescore existing models / eval with fixes e.g. when we do a better code repair tool, the LLM answer did not change, so we should rescore right away with new version of tool over a whole result of an eval.
    • [ ] Automatic tool installation with fixed version
      • [ ] Go
      • [ ] Java
    • [ ] Ensure that non-critical CLI input validation (such as unavailable models) does not panic
    • [ ] Ollama support
      • [ ] Install and test Ollama on MacOS
      • [ ] Install and test Ollama on Windows
    • [ ] Allow to forward CLI commands to be evaluated: https://github.com/symflower/eval-dev-quality/pull/27#issuecomment-2077677950
    • [ ] Refactor Model and Provider to be in the same package https://github.com/symflower/eval-dev-quality/pull/121#discussion_r1603371915
  • [ ] Evaluation
    • [ ] Interactive result comparison #208
    • [ ] Benchmark quantized models, because they need less memory
      • https://github.com/symflower/eval-dev-quality/issues/209
        • https://twitter.com/HaihaoShen/status/1789178048308543688
    • [ ] Do an evaluation with different temperatures
    • [ ] Java
      • [ ] Let the Java test case for No test files actually identify and error that there are no test files (needs to be implemented in symflower test)
    • [ ] LLM
      • [ ] Improve LLM prompt
        • [ ] Take a look at https://x.com/dottxtai/status/1798443290913853770
        • [ ] Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.
          • [ ] We need to fork or use another package https://github.com/symflower/eval-dev-quality/issues/79#issuecomment-2082547660
        • [ ] Add markers for system and user e.g. https://github.com/symflower/eval-dev-quality/pull/27#issuecomment-2058992479
        • [ ] Think about a standardized way of print outputs e.g. JSON https://twitter.com/WuMinghao_nlp/status/1789094583290507626
    • [ ] Prepare language and evaluation logic for multiple files:
      • [ ] Use symflower symbols to receive files
    • [ ] Evaluation tasks
      • [ ] Evaluation task: TDD #194
      • [ ] Assess failing tests #235
      • [ ] Add evaluation task for "querying the relative test file path of a relative implementation file path" e.g. "What is the test relative file path for some/implementation/file.go" ... it is "some/implementation/file_test.go" for most cases.
      • [ ] Add evaluation task for code refactoring: two function with the same code -> extract into a helper function
      • [ ] Add evaluation task for implementing and fixing bugs using TDD
    • [ ] Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.
    • [ ] Code repair
      • [ ] 0-shot, 1-shot, ...
        • [ ] With LLM repair
        • [ ] With tool repair
    • [ ] Do test file paths through
      • [ ] symflower symbols
      • [ ] Task for models
    • [ ] Move towards generated cases so models cannot integrate fixed cases to always have 100% score
    • [ ] Think about adding more trainings data generation features: This will also help with dynamic cases
      • [ ] Heard that Snowflake Arctic is very open with how they gathered trainings data... so we see what LLM creators think and want of trainings data
  • [ ] Documentation
    • [ ] Clean up and extend README
      • [ ] Better examples for contributions
      • [ ] Overhaul explanation of "why" we need evaluation, i.e. why is it good to evaluate for an empty function that does nothing.
      • [ ] Extend "how to extend the benchmark" section with instructions on how to add new tasks + languages, so we can even use LLMs to add new stuff
    • [ ] Write down a playbook for evaluations, e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour break in between to not run into cached responses.
  • [ ] Content
    • [ ] Benchmark that showcases base-models vs their fine-tuned coding model e.g. in v0.5.0 we see that Codestral, codellama, ... are worse
    • [ ] Snowflake against Databricks would be a nice comparison since they align company-wise and are new
    • [ ] Write Tutorial for using Ollama
    • [ ] YouTube video for using Ollama
    • [ ] Blog post about the different suffixed of models e.g. "chat" and "instruct" and eval them somehow. Idead from https://www.reddit.com/r/LocalLLaMA/comments/1bz5oyx/comment/kyrfap4/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
    • [ ] Blog post about HumanEval
    • [ ] Blog post about training a small LLM directly on HumanEval
    • [ ] Blog post about "non-determinism of LLMs" https://community.openai.com/t/a-question-on-determinism/8185 good starting point, and how we can make them at least more stable.
    • [ ] Blogpost idea: misleading comments, weird coding style... how much does it take to confuse the most powerful AI? @ahumenberger
      • [ ] Maybe not only comments. What about obfuscated code, e.g. function and variables names are just random strings?
  • [ ] Research
    • [ ] Take a look at https://twitter.com/SMT_Solvers/status/1783540994304066006
    • [ ] Take a look at all OpenRouter's features of the API e.g. https://openrouter.ai/docs#parameters
      • [ ] https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l29o97q/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button mentioned that repetition_penalty disabled helps with performance and better results for coding.
      • [ ] Requested new models for the eval https://www.reddit.com/r/LocalLLaMA/comments/1cihrdt/comment/l2d4im0/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
    • [ ] Look at what Paul Gauthier is doing with benchmarking of Aider (and Aider too) https://twitter.com/paulgauthier/status/1787827703158386921?s=46 seems like a perfect match for what we want to do as tasks
    • [ ] Look at MLX which maybe helps with execution for us https://twitter.com/loganthorneloe/status/1787845883519775120
    • [ ] Take a look at xLSTM https://twitter.com/HochreiterSepp/status/1788072466675335185
    • [ ] Take a look at eval https://twitter.com/JiaweiLiu_/status/1783959954321252697
    • [ ] Take a look at evaluation framework https://twitter.com/hamelhusain/status/1788936691576975382?s=61
    • [ ] Dig through https://arxiv.org/pdf/2405.14782 thanks to https://x.com/clefourrier/status/1793913394871062970
    • [ ] Take a look at https://x.com/dottxtai/status/1798443290913853770
  • [ ] Think about a commercial effort of the eval, that we can balance some of the costs that goes into maintaining this eval

zimmski avatar Jan 15 '25 13:01 zimmski