Overall Report for Eval Set As a Whole
Describe the feature or improvement you're requesting
Eval set is useful for running a group of evals at the same time. Currently eval set is just a collection of independent evals and oaievalset command is simply a wrapper that runs multiple oaieval commands concurrently.
I think it should be useful to analyze the data from eval set as a whole, especially if all evals in the eval sets have the same metric. Under this circumstance, we aim to do same experiment by asking similar questions. We split them into different evals because they are classified by different data. For example, if we want to evaluate LLM's performance on detecting spam in different languages. We want to get accuracy for different languages, as well as the overall detection accuracy for all spams. It would be great if eval set can generate this kind of overall report automatically.
Additional context
This feature request is an Idea for Eval, the framework itself, but not for adding new evals.