eval-dev-quality issues

Track how many characters were present in a model response and generated test files

Part of #128

enhancement

Roadmap for v0.5.0

2

The v0.5.0 is mainly meant for introducing more variate. There are three main goals 1. Introduce more logical cases, to make sure that "better models" have a bigger difference in...

zimmski

enhancement

Track how many characters were present in code part / complete response

2

### Tasks - [x] Introduce 2 assessment keys: - AssessmentKeyResponseCharacterCount - AssessmentKeyGenerateTestsForFileCharacterCount - [x] LLM model - File: `model/llm/llm.go` - Function: `GenerateTestsForFile` - [x] When parsing the model response, count...

bauersimon

enhancement

More "write tests" task cases with more complex logic

3

bauersimon

enhancement

Generic OpenAI API provider

1

Merge #27 first so we can test this and refactor.

bauersimon

enhancement

Follow-Up from using Git to reset the temporary directory

@zimmski The regex to check for the temporary test directory does not work on Windows right now but I didn't want to further postpone the PR because of it. Would...

Munsio

refactor

Test for pulling Ollama model is flaky

2

https://github.com/symflower/eval-dev-quality/actions/runs/9139604580/job/25132073770#step:9:841

zimmski

bug

flaky

CI

Sandbox execution

We need a common helper to sandbox all the executions we are doing. Right now, an LLM could generate a remove-all-your-files call, and we just execute it.

zimmski

enhancement

good first issue

help wanted

Include linters in the development environment and CI

We want at least - goimports (maybe even https://github.com/mvdan/gofumpt) for formatting - https://github.com/dominikh/go-tools - https://github.com/mgechev/revive

zimmski

enhancement

good first issue

help wanted

Upload binaries of the evaluation binary for all OSes and architectures for users that only want to benchmark

zimmski

enhancement

good first issue

help wanted

eval-dev-quality
eval-dev-quality copied to clipboard

Metadata

Track how many characters were present in a model response and generated test files

Roadmap for v0.5.0

Track how many characters were present in code part / complete response

More "write tests" task cases with more complex logic

Generic OpenAI API provider

Follow-Up from using Git to reset the temporary directory

Test for pulling Ollama model is flaky

Sandbox execution

Include linters in the development environment and CI

Upload binaries of the evaluation binary for all OSes and architectures for users that only want to benchmark

← Metadata

Owner

Metadata

eval-dev-quality eval-dev-quality copied to clipboard

Metadata

← Metadata

Owner

Metadata

eval-dev-quality
eval-dev-quality copied to clipboard