goose
goose copied to clipboard
data model
Invocation
- help:
cargo run --bin goose -- bench --help cargo run --bin goose -- benchto run the "core" suite of bencharkscargo run --bin goose -- bench -s $suite_name1,$suite_name2,...,etc- add new benchmark-suites to
crates/goose-bench/src/eval_suites
Semantics [DO NOT SKIP READING]
- there is a
coresuite of evaluations that runs by default if the--suitescli flag is not set- differently stated, any evaluation not included in
corewill not run
- differently stated, any evaluation not included in
- if
--suitesis supplied, only the items in that list will run, so ifcoreisnt part of the list of suites passed to--suites, it will not run.
Individual Evals
- example can be examined here:
crates/goose-bench/src/eval_suites/core/complex_tasks/flappy_bird.rs - groups of related evals can be placed together in a module
crates/goose-bench/src/eval_suites/core/$group_name- In this example
coreis the$suite_name - where each eval is in its own file at
crates/goose-bench/src/eval_suites/core/$group_name/$eval_name - register new evals to the top-level suite-name, not their group name.
- ex. suite name
corehas one groupcomplex_tasks, which has one evalflappy_birdso its registered as follows: register_evaluation!("core", FlappyBird)
- ex. suite name
- In this example
Limitations
- [ ] no namespacing until this PR is merged in.
- until then, wherever its run, and whatever its allowed to do (via exts), it will, without isolating its work to a tmp env
- [ ] copy files needed for eval into eval work-dir
- [ ] summary/run-report/errors-report
- [ ] tracing. maybe it works, maybe it doesnt, havent checked.
- [ ] ~~does not handle configuring ollama. still necessary to manually config before running bench~~
- [ ] ~~test multiple configs easily.~~
- ~~currently runs tests for the agent/config thats active in the environment its run.~~
- [ ] ~~parallelize at evals-level, or suite-level, or goose-bench~~ struck items are outside the scope of current bench-work.