evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

Feature: standardize inputs/outputs of metrics

Open lvwerra opened this issue 2 years ago • 28 comments

Currently there are several different inputs/output formats possible in Metrics. We should standardize them as much as possible and respecting the following principle:

  • inputs/outputs are easy to understand and use
  • outputs are compatible with other frameworks

For the output standardization: probably a dictionary structure, even if nested would be ok. Also a dedicated output class could be considered like in transformer models but this is probably not necessary here. To make it compatible with e.g. keras we could add a postprocess function at initialization similar a transform in datasets.

There are three options we could implement:

load_metric(..., postprocess="metric_key") # equivalent result to `metric.compute()["metric_key"]`
load_metric(..., postprocess="flatten") # equivalent flattening the output dict: `flatten(metric.compute())`
load_metric(..., postprocess=func) # equivalent result to `func(metric.compute())`

lvwerra avatar Apr 06 '22 14:04 lvwerra

As per our meeting today, we proposed to have standardized structure for inputs, in dictionary form.

An initial proposal of that structure can be:


     {
        "references":  ,
        "predictions": ,
     }

With references and predictions being of any format (strings, images, numbers, vectors, etc.). I was looking at examples of computer vision metrics and this should work for those as well.

Edge cases:

  • COMET, WikiSplit and SARI -- take an additional input, sources
  • F1, Precision and Recall -- require average parameter for multiclass labels, but we could define a default if needed
  • Perplexity -- needs an input string and a model

I think we could have additional, optional, source and average inputs, but I don't really know what to so for perplexity :sweat_smile: (I mean, in any case, the metrics will not function without these arguments, but I guess waiting for them to crash isn't the best solution)

CC @lvwerra @lhoestq @apsdehal

sashavor avatar Apr 08 '22 20:04 sashavor

Just to distinguish the two cases: sources is part of the input data to the metric (one source per sample), and average is a mandatory parameter of the metric in the multilabel setting.

I think it's fine if we allow certain metrics to have extra inputs like sources - but it must be possible to know that in advance, maybe we can provide a property to the Metric object that lists the expected inputs, and extra inputs should have standard names as well.

It's also fine to require certain metric parameters depending on the task like for F1, we can define that when we start to map metrics to tasks.

Regarding perplexity I guess it will depend on how it's used:

  • Is it a tool to evaluate a model based on its predictions ? Does it have to fit the other metrics formats to be used in automated evaluations ? In this case it should probably take a model output as metric input, maybe the logits as "predictions" ?
  • it is a measurement tool for data ? In this case the current implementation with "input_texts" is the most convenient.

lhoestq avatar Apr 11 '22 09:04 lhoestq

Sorry, indeed, I should have distinguished inputs and parameters :)

On Mon, Apr 11, 2022 at 6:00 AM Quentin Lhoest @.***> wrote:

Just to distinguish the two cases: sources is part of the input data to the metric (one source per sample), and average is a mandatory parameter of the metric in the multilabel setting.

I think it's fine if we allow certain metrics to have extra inputs like sources - but it must be possible to know that in advance, maybe we can provide a property to the Metric object that lists the expected inputs, and extra inputs should have standard names as well.

It's also fine to require certain metric parameters depending on the task like for F1, we can define that when we start to map metrics to tasks.

Regarding perplexity I guess it will depend on how it's used:

  • Is it a tool to evaluate a model based on its predictions ? Does it have to fit the other metrics formats to be used in automated evaluations ? In this case it should probably take a model output as metric input, maybe the logits as "predictions" ?
  • it is a measurement tool for data ? In this case the current implementation with "input_texts" is the most convenient.

— Reply to this email directly, view it on GitHub https://github.com/huggingface/evaluate/issues/10#issuecomment-1094846962, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMMIIWGLXFIHBVFA7GKZE3VEPZ2TANCNFSM5SWIFFNQ . You are receiving this because you commented.Message ID: @.***>

-- Sasha Luccioni, PhD Postdoctoral Researcher (Mila, Université de Montréal) Chercheure postdoctorale (Mila, Université de Montréal) https://www.sashaluccioni.com/ [image: Image result for universite de montreal logo]

sashavor avatar Apr 11 '22 13:04 sashavor

Also regarding perplexity: The currently implemented version aims at being a data measurement tool if I understand correctly - you provide text and model. Although one could evaluate a model in that setting it would be less efficient as it requires an additional forward pass. So we could move this version of perplexity to a new folder (e.g. measurements) and load it with load_measurement("perplexity") in the future. Since these tools will measure different modalities should we call it input_data to make it agnostic?

lvwerra avatar Apr 12 '22 12:04 lvwerra

Sure, that could work! I mean, having two separate versions of the code seems a bit redundant, but I agree that the way it's implemented now makes it stand out from the other metrics :)

On Tue, Apr 12, 2022 at 8:05 AM Leandro von Werra @.***> wrote:

Also regarding perplexity: The currently implemented version aims at being a data measurement tool if I understand correctly - you provide text and model. Although one could evaluate a model in that setting it would be less efficient as it requires an additional forward pass. So we could move this version of perplexity to a new folder (e.g. measurements) and load it with load_measurement("perplexity") in the future. Since these tools will measure different modalities should we call it input_data to make it agnostic?

— Reply to this email directly, view it on GitHub https://github.com/huggingface/evaluate/issues/10#issuecomment-1096638380, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMMIIQCGMJ6D5YXY4FTGWTVEVRHLANCNFSM5SWIFFNQ . You are receiving this because you commented.Message ID: @.***>

-- Sasha Luccioni, PhD Postdoctoral Researcher (Mila, Université de Montréal) Chercheure postdoctorale (Mila, Université de Montréal) https://www.sashaluccioni.com/ [image: Image result for universite de montreal logo]

sashavor avatar Apr 12 '22 12:04 sashavor

Postprocessing

IIRC we discussed for the postprocessing step that a more flexible alternative is to add a method to the class such that we could do:

>>> metric = load_metric("bleu")
>>> metric.output_transform(lambda x: x["bleu"])
...
>>> metric.compute()
7.4 # instead of default {"bleu": 7.4, "precisions": 0.4, "brevity_penalty": ...}

Is that what you had in mind @lhoestq?

Input format

@sashavor did you see many metrics that violated the proposed format? I think most of them are in that format since it used to be enforced? Then the next thing is to go one level deeper and look at what's in references and predictions in the individual metrics to make them uniform, too?

lvwerra avatar Apr 12 '22 12:04 lvwerra

The only metrics that violate the proposed format are in the Edge cases in my comment above -- all the other metrics that we currently have are compatible with it.

But yes, the next step would be standardizing the references and predictions format to... a dictionary as well?

sashavor avatar Apr 12 '22 13:04 sashavor

IIRC we discussed for the postprocessing step that a more flexible alternative is to add a method to the class such that we could do:

metric = load_metric("bleu") metric.output_transform(lambda x: x["bleu"]) ... metric.compute() 7.4 # instead of default {"bleu": 7.4, "precisions": 0.4, "brevity_penalty": ...} Is that what you had in mind @lhoestq?

Yup, we can even name the method .postprocess(...) ^^

But yes, the next step would be standardizing the references and predictions format to... a dictionary as well?

We not keep lists of strings or floats directly. Or use dictionaries if we feel like having explicit named fields like texts, tokenized_texts, labels could help - but maybe it would make it less convenient to use

lhoestq avatar Apr 12 '22 16:04 lhoestq

As per @yjernite 's proposal, maybe it makes sense to have two categories of metrics: the ones with references and the ones without references (like perplexity and stuff like n-gram diversity). Do you think that makes sense, @lvwerra and @lhoestq ?

sashavor avatar Apr 12 '22 18:04 sashavor

Would these be the metrics that you call measurement? Could we set references=None in these cases so they don't throw an error if a reference is passed? That way they would be easily compatible with the other metrics like BLEU or ROUGE.

lvwerra avatar Apr 13 '22 12:04 lvwerra

Sure, that could work! I think @yjernite preferred the term "referenceless metrics" because they actually measure model performance, but if we implement them with references = None by default, that would work!

sashavor avatar Apr 13 '22 13:04 sashavor

I like referenceless metrics / metrics for unsupervised learning.

In terms of implementation if possible I would remove any mention of "references" to not make things confusing

lhoestq avatar Apr 13 '22 13:04 lhoestq

So you would have a different input structure for the referenceless metrics?

sashavor avatar Apr 13 '22 13:04 sashavor

Yes I think it's fine

lhoestq avatar Apr 13 '22 13:04 lhoestq

For perplexity, to fit it into a generic structure and also cover other use cases, it would make sense to have a preprocess step? Something like:

perplexity = load_metric("perplexity")
model = SomeAutoModel()

def transform(x):
    x.update({"logits": model.compute_logits(references)})
    return x
perplexity.set_preprocess_transform(transform)
perplexity.compute(some_reference_list)

apsdehal avatar Apr 13 '22 16:04 apsdehal

A little late to the conversation, but for additional context:

"reference-less" metrics are particularly relevant for NLG evaluation, and probably going to become more common in the near future. Perplexity is one case, n-gram diversity another, and approaches like BOLD try to measure model biases without a reference. They are used to measure model performance, so should really live with the other metrics rather than as data measurements.

I'd recommend looking at the GEM-metrics repo if you haven't already as Sebastian has put a lot of thoughts into metrics for NLG there (and I think he'd also be glad if we could make that repo obsolete so he doesn't have to keep maintaining it :stuck_out_tongue_winking_eye: )

In particular: https://github.com/GEM-benchmark/GEM-metrics/blob/main/gem_metrics/metric.py

yjernite avatar Apr 13 '22 21:04 yjernite

From a design perspective, I guess one option would be to have the model prediction as required input, and a flexible "extra information" other field that may or may not include references, and could also have the source for text simplification metrics or the supporting paragraph for metrics like QuestEval (measures whether an answer provided by a QA system is supported by the paragraph)

wdyt?

yjernite avatar Apr 13 '22 21:04 yjernite

We didn't talk about standardizing outputs much until now -- all metrics currently output dicts, except cer and wer (which output a float), and mauve (outputs a types.SimpleNamespace, whatever that is). @lhoestq , you mentioned that it's better to output floats instead of dicts, for compability with some kind of framework?

sashavor avatar May 02 '22 20:05 sashavor

Keras takes a list of "keras metrics" (i.e. callables with input tensors y_true, y_pred and one float as output) for evaluation. But it seems that using something else that Keras internals to compute the metrics is not recommended: https://github.com/keras-team/keras/issues/9640

Alternatively one can use "keras callbacks". There is a nice example that shows how to use callbacks to use scikit-learn metric in Keras: https://keras.io/examples/keras_recipes/sklearn_metric_callbacks/ - though it's some boiler plate code, maybe we can do something about it

I think we can use metrics that return a dictionary with keras callbacks - this way Keras users can use any metric in our catalog. Keras callbacks can also be used if we return only a float.

Though since our metrics don't subclass the keras callback class, we might need to add a way to convert our metrics to keras callbacks.


Returning a single float is still interesting in some cases. For example it makes things less redundant:

  • currently one has to call load_metric("accuracy").compute()["accuracy"] to get the result, while it's obvious they want the "accuracy" field of the output.
  • because in general, metrics can have several outputs in the dictionary, if one wants to use evaluate and write a report in YAML, they would have to write things like
results:
- accuracy:
  - accuracy: 0.99

Therefore I would actually explore if it's feasible to split the metrics so that they return a single value when possible.

The metrics that don't return a single value can be considered "multi-ouput" metrics and have to be treated differently in reports (maybe using a nested approach as presented in the YAML example above)

Would love your comments and ideas on this subject !

lhoestq avatar May 03 '22 17:05 lhoestq

I have been thinking about the scalar vs. dict question. Having a dict across all metrics at least internally is nice as it allows to treat them all the same way and we can also combine metrics by merging dicts. At the same time we could check if the return is just a dict with one value and if that's the case just return its. value.

metric = evaluate.load("accuracy") 
>>> 0.6

metric = evaluate.load("accuracy",  force_dict=True)
>>> {"accuracy": 0.6}

What do you think @lhoestq?

Regarding Keras, I'll think a bit more about how to do that smoothly.

lvwerra avatar May 19 '22 07:05 lvwerra

I think the issue was combining metrics too, no? I.e. if I run a bunch of metrics and I want to transmit the results somewhere, it's easier to already have them in a dict. Why not make the force_dict version the default one and then make a keras flag (or a scalar) flag as an alternative?

sashavor avatar May 19 '22 12:05 sashavor

I think @lhoestq point was that for many metrics the user might just want a single value but always needs to access the dict load_metric("accuracy").compute()["accuracy"]

When we combine metrics we can also check if the return type is a dictionar or scalar and make sure we combine them correctly.

lvwerra avatar May 19 '22 15:05 lvwerra

Ok, that works too!

sashavor avatar May 19 '22 16:05 sashavor

I think I would expect this code to run:

acc = load("accuracy")
f1 = load("f1")
metrics = combine_metrics([acc, f1])
output = metrics.compute(...)
print(output["accuracy"])
print(output["f1"])

without having to specify force_dict no ?

Maybe combine_metrics can accept single-output metrics as well as multi-output metrics. For single value metric it would take the metric name as a key, and for multi-output metrics it would take the output keys as is.

Another option would be to always take the metric name as key.

For example for accuracy and seqeval, instead of having

output["accuracy"], output["macro avg"]

we could have

output["accuracy"], output["seqeval"]["macro avg"]

Let's discuss this tomorrow together if you want ?

lhoestq avatar May 30 '22 16:05 lhoestq

Any updates on this? I'm running into the headache that, indeed, I need to write my own metric-specific edge cases depending on the expected input. If there is a consensus about how to move forward with this, I can work on it relatively elaborately!

Also note that existing metrics may need to checked. Some of them force type transformation themselves, which I think we want to avoid or make part of the library rather than make that a metric requirement (>)

BramVanroy avatar Aug 11 '22 12:08 BramVanroy

We're faced with this lack of standardisation too. In our package we're working with metrics in a generic fashion, and not having a uniform output format of the metrics is quite frustrating. We can hack our way out of it by checking for the dtype of the outputs, but it would be much preferable if they all returned the same format.

Given that some metrics inherently outputs multiple values (such as squad_v2), I'd argue that a dict would make the most sense. This would be a simple (but breaking) change, changing cer, wer and mauve to returning dicts.

I can create a quick PR fix if that would be alright, @lvwerra?

saattrupdan avatar Oct 27 '22 11:10 saattrupdan

We tried to make dicts for most metric, but it's possible we missed those. Yes, feel free to open a PR, but we should also check that some of the popular libraries (e.g. transformers) using evaluate will not break if we merge this.

Two things we can do:

  1. search for evaluate.load("cer" on GitHub)
  2. make the returned type dependant on evaluate.__version__ and only return a dict for the new release existing installations don't suddenly break.

Do you want to have a stab at it?

lvwerra avatar Nov 17 '22 11:11 lvwerra

@lvwerra Hello. Can someone help with https://github.com/huggingface/evaluate/issues/516 ?

I am trying to combine two metrics within an ASR task. And it seems that my problem is related to the topic discussed here. I am using a notebook to fine tune the Whisper from @sanchit-gandhi . But if I use the approach of combining WER and CER metrics, everything breaks down. The Trainer does not get what he needs from the compute_metrics function. It seems that the problem is that the metrics return not the dictionary but the value....

blademoon avatar Nov 20 '23 13:11 blademoon