evaluate
evaluate copied to clipboard
CharacTER: MT metric
Add the CharacTER MT evaluation metric that was introduced in CharacTer: Translation Edit Rate on Character Level.
Specifically, this implementation uses the repackaged version of the original for usability reasons.
The documentation is not available anymore as the PR was closed or merged.
Experiencing some isort issues like WARNING: Unable to parse file tests due to [Errno 13] Permission denied: 'C:\\Python\\projects\\evaluate\\tests'
. Will investigate later.
Thanks @BramVanroy, this is cool! Will add to my list to review mid-next week :)
@mathemakitten I fixed the isort issues.
Hi @BramVanroy, thanks so much for putting this together! I added some comments and questions, please have a look and let me know if I can clarify anything 🤗
Thanks for having such a detailed look @mathemakitten! I incorporated your suggestions. I am encountering an issue again with the required format. I have experienced this behavior and I am not sure how to solve this.
It seems that when I pass a single sentence (not a list) the string is used as a sequence of letters, with the following errors:
ValueError: Mismatch in the number of predictions (71) and references (82)
You can trigger this error by running python -m pytest .\metrics\character\tests.py
.
When running compute
, what kind of format is expected? I tried adding a batch dimension, but that will pass the list to compute
, which in turn will trigger cer.calculate_cer_corpus
instead of calculate_cer
. So it seems I can never "just" pass a string? Please advise me on that.
When running compute, what kind of format is expected? I tried adding a batch dimension, but that will pass the list to compute, which in turn will trigger cer.calculate_cer_corpus instead of calculate_cer. So it seems I can never "just" pass a string? Please advise me on that.
So compute always expects a list of samples. So if you want to use a feature consists of Value("string")
you should pass the following:
metric.compute(predictions=["my example"], references=["your example"])
If you only want to pass a single example you can use add
:
metric.add(predictions="my example", references="your example")
Finally, if your features themselves are lists (e.g. tokenized strings), you should pass them as follows:
metric.compute(predictions=[["my", "example"]], references=[["your", "example"]])
Does that make sense? In your case it indeed seems that a string is interpreted as a list and that might be causing the mismatch so I am guessing you passed the add
format above to compute
. If not, could you share an example?
Looks good. I wonder if it would make sense to extend it to the case where you have multiple references? E.g. take the minimum in that case, which corresponds to the score to the most similar reference? This would make it easy to combine it with all the other text metrics. What do you think?
Sure, I can do that. Should it also return the most similar reference in that case, or just the score?
Sure, I can do that. Should it also return the most similar reference in that case, or just the score?
Just the score would be enough. Looking at the code of the underlying library I noticed that you are returning individual scores. To keep the metric in line with the other metrics it would be great if the primary returned value is a scalar not a list of score per example. It's ok to have it as a secondary score - even better would to have a aggregate=True
keyword and only return all the scores when set to False
. That keeps to user experience leaner.
Both (generalize to multiple refs and return only a scalar) also applies to #290.
Sure, I can do that. Should it also return the most similar reference in that case, or just the score?
Just the score would be enough. Looking at the code of the underlying library I noticed that you are returning individual scores. To keep the metric in line with the other metrics it would be great if the primary returned value is a scalar not a list of score per example. It's ok to have it as a secondary score - even better would to have a
aggregate=True
keyword and only return all the scores when set toFalse
. That keeps to user experience leaner.Both (generalize to multiple refs and return only a scalar) also applies to #290.
Okay. Is there a canonical way to aggregate when using aggregate=True
? Suggestion: add keyword argument aggregate
that has the follow options:
- True/"mean": returns the average (default)
- sum: returns the sum
- median: returns the median
And a secondary kwarg return_all_scores
that optionally returns all scores as a list. Default=False.
What do you think?
I implemented my suggestion above. Please let me know if that is in line with what you expected @lvwerra, so I can update the PR in charcut in a similar way.
Looks like Literal
was added only in Python 3.8 (we are still supporting 3.7). Could you fix this? If the CI is green we can merge :)
Done!
I saw that there are some suggestions from @mathemakitten left, would you mind adding them?
I saw that there are some suggestions from @mathemakitten left, would you mind adding them?
Done.