gptme
gptme copied to clipboard
Benchmarks/evals
I did some smaller benchmarks (more like tests, really) and would like to continue with this endeavor to evaluate capabilities and weak spots.
Would also be interesting to test on codegen tasks vs gpt-engineer (see #62), such as the gpt-engineer suite and SWE-bench.
- [x] Set up basic evals
- [x] Write docs
- [x] Dockerize
- [ ] Write more difficult eval set
- [x] project init (git, rust, react)
- [ ] SWE-Bench: https://github.com/ErikBjare/gptme/pull/142
- [ ] npm run dev + browser + screenshot + edit request: #52
- [ ] Write up a blog post or similar
Improved the eval harness quite a bit in #90, among other changes (incl a lot of Docker stuff).
I'm now 80% happy with the harness and am trying to think about how it would provide value for the project/community.
Including which types of things to eval (shell scripting, complicated patches, python repl stuff), and which external evals we should try to run gptme on (would prob be a great learning opportunity to get experience with other evals).