goose
goose copied to clipboard
work dirs
Invocation
- help:
cargo run --bin goose -- bench --help cargo run --bin goose -- benchto run the "core" suite of bencharkscargo run --bin goose -- bench -s $suite_name1,$suite_name2,...,etccargo run --bin goose -- bench --repeat 3to run the evals 3 timescargo run --bin goose -- bench -i "some_dir,some_other_dirto havesome_dir&some_other_dircopied into the relevant workdir that needs it.- add new benchmark-suites to
crates/goose-bench/src/eval_suites
How Work-Dirs...work
- the purpose of the work-dir is to have a place to read-write files, that can be referenced as the "current directory" from within the evaluation code
- each invocation of
goose benchwill create if not exists, a dir for the provider under which will have - a date-time dir for the run, under which,
- a dir per eval-suite, under which,
- a dir for the eval-itself
- multiple runs for the same provider will result in a tree like the following.
Semantics [DO NOT SKIP READING]
- there is a
coresuite of evaluations that runs by default if the--suitescli flag is not set- differently stated, any evaluation not included in
corewill not run
- differently stated, any evaluation not included in
- if
--suitesis supplied, only the items in that list will run, so ifcoreisnt part of the list of suites passed to--suites, it will not run.
Individual Evals
- example can be examined here:
crates/goose-bench/src/eval_suites/core/example.rs - groups of related evals can be placed together in a rust module representing the suite
crates/goose-bench/src/eval_suites/core- In this example
coreis the$suite_name - where each eval is in its own file at
crates/goose-bench/src/eval_suites/core/$eval_name - register new evals to the suite-name.
- ex. suite name
core, which has one evalexampleso its registered as follows: register_evaluation!("core", ExampleEval)
- ex. suite name
- In this example
Limitations
- [x] no namespacing until this PR is merged in.
- until then, wherever its run, and whatever its allowed to do (via exts), it will, without isolating its work to a tmp env
- [x] copy files needed for eval into eval work-dir
- [ ]
bug:building with--releaseaffects which eval suites are run.To Be Debugged - [ ] summary/run-report/errors-report
- [ ] tracing. maybe it works, maybe it doesnt, havent checked.
- [ ] ~~does not handle configuring ollama. still necessary to manually config before running bench~~
- [ ] ~~test multiple configs easily.~~
- ~~currently runs tests for the agent/config thats active in the environment its run.~~
- [ ] ~~parallelize at evals-level, or suite-level, or goose-bench~~ struck items are outside the scope of current bench-work.
PR Preview Action v1.6.0 :---: |
:rocket: View preview athttps://block.github.io/goose/pr-preview/pr-1307/
|