evaluate Evaluation harness

Here's a rough proposal for an evaluation harness interface, where users pass in a JSON file which configures the evaluator and "tasks", made up of a dataset, metric, and other kwargs to be passed into the evaluator. Currently it reads in local JSON files as well as those in any Space on the Hub (e.g. like this one).

It can be tested with an example of a sentiment analysis harness I put together with the IMDB and SST2 datasets:

    >>> from evaluate.harness import Harness

    >>> harness_config = Harness.from_json('mathemakitten/sentiment')
    >>> harness = Harness(harness_config)
    >>> results = harness.run()

I'd love feedback on the items I've noted in the code below as well as overall design, cc @lvwerra + @lhoestq

Aug 01 '22 23:08 mathemakitten

The documentation is not available anymore as the PR was closed or merged.

Aug 01 '22 23:08 HuggingFaceDocBuilderDev

A Harness is an interesting addition! Thanks for working on this! A few thoughts:

API

I think we could skip the step of returning a config:

harness = Harness.from_json('mathemakitten/sentiment')

Also we would need to pass a model, right?

harness.run(model="gpt2") # or maybe .evaluate()?

Dataset splits

There is #226 which we need to fix and probably adding a dataset_split kwarg to the evaluator is necessary such that we don't need to pass the loaded datasets to each evaluator.

Post-processing

I am not so familiar with how a harness should work - does it require to compute an overall score? Do we need to provide a function that takes the list of all result dicts and computes an aggregated score?

lm-evaluation-harness

When we setup the Harness should we make sure we are easily compatible with the lm-evaluation-harness so we could easily integrate it?

Right time?

I think the main application for a harness are general language models rather than models such as classifiers etc since those usually don't transfer well to other tasks (sentiment is maybe the one exception for classification). So we would need an evaluator first that takes a language model and a supervised task and computes the scores. Should we add this first? I think the needs will become much clearer once we have that and if we can solve a use-case while developing it we can maximise nailing the API.

What do you think?

Aug 04 '22 16:08 lvwerra

Thanks @lvwerra 🌊

Skipping the config sounds fine to me. I had modelled this after the Transformers return AutoConfig.from_pretrained(*args, **kwargs) logic but am also happy to init the Harness object directly, will change.
Right now we allow a user to pass in the model via config.model or to harness.run(model_or_pipeline=...) but I'm OK to remove the first option and only allow them to pass it into harness.run!
I don't have strong feelings about harness.run vs harness.evaluate but I do think that the word evaluate is used a lot here (lol) in various contexts and I often hear people refer to it as "run the evaluation harness", hence run being a sensical name. Not a strong opinion though, so if you feel differently then .evaluate would be fine too.
re: postprocessing, no: the metric ranges can be unscaled and thus not good candidates for meaningful averaging/aggregating; often the results are reported per-task in tabular format so the current output of Evaluator actually works quite well (we report a dictionary with each result output from the Evaluator as a key; this is easily transformable into a dataframe or whatever downstream format the user wants to work in)
Yes, I'm going to try out writing up a JSON which will run the lm-evaluation-harness after we finalize design!
For "right time" + "dataset splits" comments: good points, I can take a look at those shortly before continuing design on this

Aug 04 '22 19:08 mathemakitten

Skipping the config sounds fine to me. I had modelled this after the Transformers return AutoConfig.from_pretrained(*args, **kwargs) logic but am also happy to init the Harness object directly, will change.

I guess it's slightly different because it is then used with AutoModel. I think there we don't need that extra step.

Right now we allow a user to pass in the model via config.model or to harness.run(model_or_pipeline=...) but I'm OK to remove the first option and only allow them to pass it into harness.run!

I feel like the model/pipeline should have a special place. The harness object is defined by a list of tasks via the config and then runs a model through those tasks. So I am in favour in specifically pass it in run.

I don't have strong feelings about harness.run vs harness.evaluate but I do think that the word evaluate is used a lot here (lol)

Fair point, I also don't have strong feeling so let's keep run (or compute which would be similar to metrics and evaluators).

Yes, I'm going to try out writing up a JSON which will run the lm-evaluation-harness after we finalize design!

I think that would be a great goal with this PR: add the Harness and the lm-evaluation-harness (or at least a minimal version of it). We could host canonical on an org evaluate-harness (similar to https://huggingface.co/evaluate-metric). If we add a README they would come with a nice documentation page too.

PS: I have a slight preference to call from_json rather from_config which can point to a repo. Also keeps the door open should we want to support other formats.

Aug 05 '22 09:08 lvwerra

Merging with main resulted in a ton of extra commits from the merge commits which has made the history on this branch really difficult to read/rebase so I'll likely redo this whole PR on a new branch.

We talked about uploading harness config files as Datasets as opposed to a Space, which isn't yet accounted for in this PR, but all that has to change is the harness loading logic (both datasets and spaces support JSON files IIRC).

I'll also provide an example of how this works but with GLUE as opposed to the (much bigger and more complicated) Eleuther harness. A lot of the zero-shot tasks are currently unsupported because they require data preprocessing to get the examples into a format suitable for the task, which we don't yet properly support.

Sep 23 '22 22:09 mathemakitten