Document how we organize datasets for automatic evals

Open ravenac95 opened this issue 8 months ago • 1 comments

We should have a document on how datasets are discovered for automatic eval execution

Current plan:

Each of the evals should have a frequency value. The values should be something like cron, on-deployments.
If the frequency is set to cron, then a cron value should be set.
Each of the datasets should have tags in the style eval:NAME_OF_EVAL where the value is a boolean if that specific eval should be enabled

May 08 '25 16:05 ravenac95

May 08 '25 16:05 linear[bot]

At the very least, we want to be able to specify metadata filters, for example:

!run_eval text2sql where the priority is high or something like that

May 28 '25 20:05 ryscheng