evaluate Evaluation harness

Requires #301 to be merged first (lots of these diffs on data loading in the evaluator code will disappear afterward).

You can run the GLUE harness like this, for example:

from evaluate.harness import Harness
harness = Harness('mathemakitten/glue-harness')
results = harness.run(model_or_pipeline='gpt2')

where the JSON file contains metadata on the dataset and args to pass to the evaluator. The example config for GLUE is here.

After this core logic is nailed down I'll move toward incorporating a larger harness (e.g. the LM harness by EleutherAI, which requires support for multiple evaluators for one model). Also, qqp is actually a semantic similarity task and should be evaluated as such, we don't yet have the right evaluator implemented, though.

Sep 26 '22 16:09 mathemakitten

The documentation is not available anymore as the PR was closed or merged.

Sep 26 '22 16:09 HuggingFaceDocBuilderDev

Thanks @lvwerra, have added all those suggestions! If the API looks good to you now I'll add some tests assuming this format :)

Sep 28 '22 16:09 mathemakitten

@lhoestq maybe we can use the train_eval_index by default if nothing is specified and have the option to overwrite (if a train_eval_index is available) or provide (if no train_eval_index is available). I think in some cases people might want to setup a custom setup?

Oct 05 '22 14:10 lvwerra

I see ! Users could overwrite it with kwargs in load_harness. And to have a full custom setup you can simply copy the dataset repository and modify the train_eval_index directly ?

Oct 05 '22 15:10 lhoestq

Thanks for the pointer re: train_eval_index! Yes, we can definitely use the existing train_eval_index by default if nothing is specified and overwrite with the JSON config if needed. Will incorporate.

Oct 17 '22 03:10 mathemakitten

Cool ! Overall I'd be in favor of not introducing new files or metadata if we can use something that exists already - this is less confusing for users

Oct 17 '22 15:10 lhoestq

Could there be a way to pass config_name down to a metric? Right now I think we only have access to metric as a string in the args_for_task, which means that AFAIK we can't pass a config_name or any other arguments for the metric.

Oct 17 '22 23:10 NimaBoscarino

OK, I had a look at how train_eval_index is formulated and I'm having a bit of trouble reconciling how to get all the features we need out of it. After looking at a few datasets with train_eval_index I am still leaning toward having a separate EvaluationSuite config file, but maybe if the fields aren't specified, then we can pull them from train_eval_index? Either way I don't think we can avoid requiring new metadata in a config somewhere, but let me know what you think? Sorry in advance for the long read! 🤗

The upside of the evaluation suite should be that it allows you to compose arbitrary existing datasets from the Hub together for evaluation.
If I wanted to evaluate on just a subset of GLUE, is there a way that I can extract just those two tasks without writing another config or custom code or cloning the dataset repo? For example, if I wanted to evaluate on x tasks from GLUE as well as y tasks from SuperGLUE, I want to reuse the existing datasets for those instead of creating a whole new dataset, copying over the data and specify a custom train_eval_index. It seems like we would have a lot of data duplication if everyone created new custom datasets (one copy for each combination of tasks we want to evaluate on). The upside of having an evaluation suite have its own config is that we can avoid copying the dataset repo and just reuse existing datasets on the Hub by composing them and using the Evaluator's data loading logic to load the subsets we need.
Since load_evaluation_suite should work with multiple different types of evaluators and it would get really messy quickly to have a big nested data structure full of kwargs for each evaluator passed around in the code, I think we should avoid passing custom kwargs to the load_evaluation_suite function. I tried to mirror the way that metric modules are loaded right now where ideally we'd like for someone to be able to simply upload a config file for the evaluation suite and the code should just load it by name without extra kwargs.
If we want to allow passing down config_name to metrics we would need to specify that somewhere as well; either in the train_eval_index or in a specific evaluation suite config file. Right now the attributes outlined in train_eval_index seem to all be attributes of the dataset itself. However, the evaluator also requires other kwargs which are not present in train_eval_index; e.g. "label_mapping": {"LABEL_0": 0.0, "LABEL_1": 1.0} — this is specific to a dataset-evaluator pair, and not an attribute of the dataset alone, so it wouldn't make sense to add it to the train_eval_index, right? If we don't want to modify the fields allowed in train_eval_index and they should remain dataset-specific, we need to pass around these specific kwargs somewhere (and see above for why I don't think it should be passed in load_evaluation_suite).

Oct 18 '22 16:10 mathemakitten

Oh ok I see ! Let's use new config files then ^^' you can use a space to store it and have a custom UI to present it. In a dataset repo you can only do markdown, and it would be weird to show a preview of a config file

Note that enriching a bit train_eval_index is maybe possible cc @lewtun

Oct 18 '22 17:10 lhoestq

okok will do! That makes sense re: markdown and config, thanks @lhoestq :) In that case I'll keep the EvaluationSuite config. I appreciate you bringing up the train_eval_index though because I didn't know that existed and I also generally agree with trying to maximize code reuse whenever possible. We should definitely consider extending train_eval_index to be usable with evaluate though, it just might be a bit of a bigger undertaking.

Oct 18 '22 20:10 mathemakitten

Following-up on last week's discussion, here are some ideas how we could frame the SubTasks (placeholder name, I am sure there is a better one) of the EvaluationSuite as classes/dataclasses. In discussions it became clear that some datasets will need some preprocessing before passing them to the evaluator and the question is where to put that logic. Here I propose to define it in a Python script along with the SubTask definition.

In evaluate:

@dataclass
def SubTask:
    model_or_pipeline: Optional[Union[str, Model]] = None
    data: Optional[Union[str, Dataset, Array]] = None
    subset: Optional[str] = None
    split: Optional[str] = None
    data_preprocessor: Optional[Callable] = None

On the Hub or locally (lvwerra/sentiment_suite.py):


preprocessor = lambda x: x["text"].lower()

suite = [
    SubTask(
        data="imdb",
        split="test",
        data_preprocessor=preprocessor
    ),
    SubTask(
        data="sst2",
        split="validation",
        data_preprocessor=preprocessor
    )
]

User (notebook_17.ipynb):

from evalute import Suite

suite = Suite.load("lvwerra/sentiment_suite") #from hub
suite.run("lvwerra/distilbert-imdb")

We could also support loading from JSON but in that case I think we should disable the data_preprocessor argument and instead point the class documentation. I think it could get complicated with the importing a Python function/module via JSON. This is just a draft/skeleton - let me know what you think!

Oct 24 '22 19:10 lvwerra

Thanks so much for this @lvwerra ! The idea of a preprocessor for the data that you can pass in is a great idea. My main concern about the given approach is that it looks like you would have to preprocess the same data for each task using the same dataset, rather than preprocessing the data once and then passing it in for each. Is that right? If so, that's a lot of duplicated work for a time-consuming computation.

My thought was that it could work something like this. Assuming you have 2 tasks you want to run. The first one might be called for example "multi_classification", for classifying content among multiple labels. The second one, "binary_classification", for a basic model.

python
preprocessor = lambda x: x["text"].lower()
 
suite = [
TaskDataset(
  data="imdb",
  split="test",
  data_preprocessor=preprocessor,
  Task("multi_classification"),
  Task("binary_classification")
),
TaskDataset(
   data="sst2",
   split="test",
   data_preprocessor=preprocessor,
   SubTask("multi_classification"),
   SubTask("binary_classification")
)
]

Does that make sense? In this way, data would only need to be processed once for all the ways it can be used.

Regarding this bit:

We could also support loading from JSON but in that case I think we should disable the data_preprocessor argument and instead point the class documentation. I think it could get complicated with the importing a Python function/module via JSON. This is just a draft/skeleton - let me know what you think!

I'm supportive of not using json, especially if not using it allows us to use a data_preprocessor. =)

Oct 24 '22 20:10 meg-huggingface

Cool ! The "EvaluationSuite" sounds clear to me and also how to define it, nice ! ^^

Other ideas of name for "SubTask" (since "task" is already used in the HF ecosystem for other things): maybe something like "EvaluationData" or "DataSource"

Could you also give more details on why there's "model_or_pipeline" in what you call "SubTask" ?

Oct 24 '22 22:10 lhoestq

Duplicate work: If we stick to datasets and use ds.map(data_preprocessor) then the processing would be cached by default. So if two tasks would use the same dataset + configuration (via split and subset) + data_preprocessor then the result would be loaded from cache rather than recomputed (even on TB scale this only takes a few minutes). Does that address your concern @meg-huggingface?
model_or_pipeline: Also not entirely sure, copied it from @mathemakitten's implementation :) The classical evaluation benchmark would take this as input I guess and assume dataset and metric as given. Maybe the idea is to build a data evaluation benchmark with it too, where the input is a dataset and the suite defines models and metrics to evaluate it.
Naming: I agree, we should change it. Throwing another option into the hat: EvaluationJob

Oct 25 '22 11:10 lvwerra

Awesome! And the SubTask/Task/EvaluationJob would receive a metric parameter? Or something like args_for_task in the glue-suite.json?

Oct 25 '22 16:10 NimaBoscarino

Thanks @lvwerra, this is super helpful for understanding your proposal!

model_or_pipeline was mirroring the arg which gets passed to evaluator.compute(), but we can move that out of the subtask and one level up in the suite definition since I think it won't vary at the subtask level. I'm also OK with scrapping JSON in favour of this dataclass implementation — JSON is easy to write but limiting if we want to include Python code (originally I wanted this to be as simple as writing JSON to define the entire suite, but it's becoming clear that won't be possible).

Question — where should data preprocessing live if it's not being used with an evaluation suite, but rather just as an extra preprocessing step before the data gets fed into evaluator.compute? Is the proposal in that case to overwrite evaluator.prepare_data? It's not obvious to me how to tie these together so it's clear to the user where data preprocessing should go, let me know what you had in mind there!

Oct 26 '22 04:10 mathemakitten

Duplicate work: If we stick to datasets and use ds.map(data_preprocessor) then the processing would be cached by default.

Thanks. =) Yes, we discussed this earlier, and it's good by me.

I do want to clarify, though: The .map function is something we would have in-place within our code, which operates on a data_preprocessor that a user inputs, and not something that everyone who writes a data_preprocessor would need to do. The latter case would be a bit hacky (We'd have to have instructions that say something like, "For your data preprocessing, please use the .map function..."), whereas if .map were in-place within our code, and applied to user input, then no users have to worry about making sure they're caching the right way.

Oct 27 '22 20:10 meg-huggingface

Thanks for the feedback and your thoughts. Answering the open questions:

@NimaBoscarino: Oh indeed, the metric should be an additional attribute of the data class.
@mathemakitten: I think we can load the dataset in the evaluation suite part and apply the preprocessing function there and pass the processed dataset to the evaluator (that does maybe additional processing internally). What do you think? I think that also answers your question @meg-huggingface? So yes, the .map should be internal and the user (or rather whoever defines the eval suite) just passes the preprocessing function.

Oct 31 '22 09:10 lvwerra

Superceded by #337!

Nov 01 '22 23:11 mathemakitten

evaluate evaluate copied to clipboard

Evaluation harness

evaluate
evaluate copied to clipboard