Standard metrics
Codecov Report
Attention: Patch coverage is 59.91561% with 95 lines in your changes are missing coverage. Please review.
Project coverage is 89.05%. Comparing base (
cdf0348) to head (91805ce).
| Files | Patch % | Lines |
|---|---|---|
| src/unitxt/standard_metrics.py | 58.03% | 94 Missing :warning: |
| src/unitxt/operators.py | 88.88% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #658 +/- ##
==========================================
- Coverage 89.83% 89.05% -0.78%
==========================================
Files 96 97 +1
Lines 9118 9350 +232
==========================================
+ Hits 8191 8327 +136
- Misses 927 1023 +96
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@dafnapension @elronbandel - Can you explain the motivation for this PR? What are standard metrics and how do they relate to the existing metrics?
Current evaluation of a global metric starts by laying the whole stream in main memory, adding "next to it" a couple of hundreds copies thereof (for the re_samplings). This breaks the 'streaming' spirit of unitxt. We are trying to see if we can stream also the global metrics . To this end, we implement the following for each global metric: (1) an instance scorer to score each individual instance (like today's) (2) an accumulator that accumulates what it needs from each instance, but not copying the whole instance. E.g. F1 accumulates the confusion matrix (a count of occurrences of each pair of (ref, pred)) over all the instances. This counter is expected to be dramatically smaller than the size of the whole evaluated stream. (3) a function yielding the final global score from the accumulated value.
The resampling is somewhat more trickier: Today, we generate a single resample by selecting, with replacements, n instances from a stream of length n. We repeat that process the number of resamples we want to use. This process does not suit streaming. So we suggest: Given an instance i, for each resample r (that we want to learn from without first building it) we randomly pick the number of times b that i is to participate in r. poisson distribution for picking b, is exactly what we need here, being a close approximation of the binomial distribution that is induced by the usual selection with replacement.
@elronbandel @dafnapension - I'm sure you discussed it between you alot, but I want to provide a different perspective.
I think stream in unitxt may be useful if unitxt used for large scale training - however, it also has significant cost in terms of code and API complexity. In evaluation , where typically only hundres of samples are tested, streaming will have no significant value.
We need metrics API that are (1) independent from each other (2) easy for users to add AND debug them.
Our direction should be of simplification and not making things more complex.
Therefore, I think it's worth to have a discussion if this direction will have a net gain in terms of unitxt acceptance.
(@eladven - will be glad your input as well).
Leave for now. If at all, continue via https://github.com/IBM/unitxt/pull/845