goose icon indicating copy to clipboard operation
goose copied to clipboard

data model

Open laanak08 opened this issue 10 months ago • 0 comments

Invocation

  • help: cargo run --bin goose -- bench --help
  • cargo run --bin goose -- bench to run the "core" suite of bencharks
  • cargo run --bin goose -- bench -s $suite_name1,$suite_name2,...,etc
  • add new benchmark-suites to crates/goose-bench/src/eval_suites

Semantics [DO NOT SKIP READING]

  • there is a core suite of evaluations that runs by default if the --suites cli flag is not set
    • differently stated, any evaluation not included in core will not run
  • if --suites is supplied, only the items in that list will run, so if core isnt part of the list of suites passed to --suites, it will not run.

Individual Evals

  • example can be examined here: crates/goose-bench/src/eval_suites/core/complex_tasks/flappy_bird.rs
  • groups of related evals can be placed together in a module crates/goose-bench/src/eval_suites/core/$group_name
    • In this example core is the $suite_name
    • where each eval is in its own file at crates/goose-bench/src/eval_suites/core/$group_name/$eval_name
    • register new evals to the top-level suite-name, not their group name.
      • ex. suite name core has one group complex_tasks, which has one eval flappy_bird so its registered as follows:
      • register_evaluation!("core", FlappyBird)

Limitations

  • [ ] no namespacing until this PR is merged in.
    • until then, wherever its run, and whatever its allowed to do (via exts), it will, without isolating its work to a tmp env
    • [ ] copy files needed for eval into eval work-dir
  • [ ] summary/run-report/errors-report
  • [ ] tracing. maybe it works, maybe it doesnt, havent checked.
  • [ ] ~~does not handle configuring ollama. still necessary to manually config before running bench~~
  • [ ] ~~test multiple configs easily.~~
    • ~~currently runs tests for the agent/config thats active in the environment its run.~~
  • [ ] ~~parallelize at evals-level, or suite-level, or goose-bench~~ struck items are outside the scope of current bench-work.

laanak08 avatar Feb 18 '25 17:02 laanak08