evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

CharacTER: MT metric

Open BramVanroy opened this issue 2 years ago • 6 comments

Add the CharacTER MT evaluation metric that was introduced in CharacTer: Translation Edit Rate on Character Level.

Specifically, this implementation uses the repackaged version of the original for usability reasons.

BramVanroy avatar Sep 02 '22 13:09 BramVanroy

The documentation is not available anymore as the PR was closed or merged.

Experiencing some isort issues like WARNING: Unable to parse file tests due to [Errno 13] Permission denied: 'C:\\Python\\projects\\evaluate\\tests'. Will investigate later.

BramVanroy avatar Sep 02 '22 14:09 BramVanroy

Thanks @BramVanroy, this is cool! Will add to my list to review mid-next week :)

mathemakitten avatar Sep 03 '22 02:09 mathemakitten

@mathemakitten I fixed the isort issues.

BramVanroy avatar Sep 11 '22 09:09 BramVanroy

Hi @BramVanroy, thanks so much for putting this together! I added some comments and questions, please have a look and let me know if I can clarify anything 🤗

Thanks for having such a detailed look @mathemakitten! I incorporated your suggestions. I am encountering an issue again with the required format. I have experienced this behavior and I am not sure how to solve this.

It seems that when I pass a single sentence (not a list) the string is used as a sequence of letters, with the following errors:

ValueError: Mismatch in the number of predictions (71) and references (82)

You can trigger this error by running python -m pytest .\metrics\character\tests.py.

When running compute, what kind of format is expected? I tried adding a batch dimension, but that will pass the list to compute, which in turn will trigger cer.calculate_cer_corpus instead of calculate_cer. So it seems I can never "just" pass a string? Please advise me on that.

BramVanroy avatar Sep 15 '22 11:09 BramVanroy

When running compute, what kind of format is expected? I tried adding a batch dimension, but that will pass the list to compute, which in turn will trigger cer.calculate_cer_corpus instead of calculate_cer. So it seems I can never "just" pass a string? Please advise me on that.

So compute always expects a list of samples. So if you want to use a feature consists of Value("string") you should pass the following:

metric.compute(predictions=["my example"], references=["your example"])

If you only want to pass a single example you can use add:

metric.add(predictions="my example", references="your example")

Finally, if your features themselves are lists (e.g. tokenized strings), you should pass them as follows:

metric.compute(predictions=[["my", "example"]], references=[["your", "example"]])

Does that make sense? In your case it indeed seems that a string is interpreted as a list and that might be causing the mismatch so I am guessing you passed the add format above to compute. If not, could you share an example?

lvwerra avatar Sep 21 '22 06:09 lvwerra

Looks good. I wonder if it would make sense to extend it to the case where you have multiple references? E.g. take the minimum in that case, which corresponds to the score to the most similar reference? This would make it easy to combine it with all the other text metrics. What do you think?

Sure, I can do that. Should it also return the most similar reference in that case, or just the score?

BramVanroy avatar Dec 06 '22 14:12 BramVanroy

Sure, I can do that. Should it also return the most similar reference in that case, or just the score?

Just the score would be enough. Looking at the code of the underlying library I noticed that you are returning individual scores. To keep the metric in line with the other metrics it would be great if the primary returned value is a scalar not a list of score per example. It's ok to have it as a secondary score - even better would to have a aggregate=True keyword and only return all the scores when set to False. That keeps to user experience leaner.

Both (generalize to multiple refs and return only a scalar) also applies to #290.

lvwerra avatar Dec 06 '22 15:12 lvwerra

Sure, I can do that. Should it also return the most similar reference in that case, or just the score?

Just the score would be enough. Looking at the code of the underlying library I noticed that you are returning individual scores. To keep the metric in line with the other metrics it would be great if the primary returned value is a scalar not a list of score per example. It's ok to have it as a secondary score - even better would to have a aggregate=True keyword and only return all the scores when set to False. That keeps to user experience leaner.

Both (generalize to multiple refs and return only a scalar) also applies to #290.

Okay. Is there a canonical way to aggregate when using aggregate=True? Suggestion: add keyword argument aggregate that has the follow options:

  • True/"mean": returns the average (default)
  • sum: returns the sum
  • median: returns the median

And a secondary kwarg return_all_scores that optionally returns all scores as a list. Default=False.

What do you think?

BramVanroy avatar Dec 06 '22 15:12 BramVanroy

I implemented my suggestion above. Please let me know if that is in line with what you expected @lvwerra, so I can update the PR in charcut in a similar way.

BramVanroy avatar Dec 07 '22 14:12 BramVanroy

Looks like Literal was added only in Python 3.8 (we are still supporting 3.7). Could you fix this? If the CI is green we can merge :)

lvwerra avatar Dec 08 '22 09:12 lvwerra

Done!

BramVanroy avatar Dec 08 '22 10:12 BramVanroy

I saw that there are some suggestions from @mathemakitten left, would you mind adding them?

lvwerra avatar Dec 08 '22 11:12 lvwerra

I saw that there are some suggestions from @mathemakitten left, would you mind adding them?

Done.

BramVanroy avatar Dec 08 '22 12:12 BramVanroy