gptme Benchmarks/evals

Benchmarks/evals

Open ErikBjare opened this issue 1 year ago • 1 comments

trafficstars

I did some smaller benchmarks (more like tests, really) and would like to continue with this endeavor to evaluate capabilities and weak spots.

Would also be interesting to test on codegen tasks vs gpt-engineer (see #62), such as the gpt-engineer suite and SWE-bench.

[x] Set up basic evals
[x] Write docs
[x] Dockerize
[ ] Write more difficult eval set
- [x] project init (git, rust, react)
- [ ] SWE-Bench: https://github.com/ErikBjare/gptme/pull/142
- [ ] npm run dev + browser + screenshot + edit request: #52
[ ] Write up a blog post or similar

Jan 20 '24 14:01 ErikBjare

Improved the eval harness quite a bit in #90, among other changes (incl a lot of Docker stuff).

I'm now 80% happy with the harness and am trying to think about how it would provide value for the project/community.

Including which types of things to eval (shell scripting, complicated patches, python repl stuff), and which external evals we should try to run gptme on (would prob be a great learning opportunity to get experience with other evals).

Sep 06 '24 14:09 ErikBjare

gptme gptme copied to clipboard

Benchmarks/evals

gptme
gptme copied to clipboard