unitxt Standard metrics

Mar 13 '24 16:03 dafnapension

Codecov Report

Attention: Patch coverage is 59.91561% with 95 lines in your changes are missing coverage. Please review.

Project coverage is 89.05%. Comparing base (cdf0348) to head (91805ce).

Files	Patch %	Lines
src/unitxt/standard_metrics.py	58.03%	94 Missing :warning:
src/unitxt/operators.py	88.88%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #658      +/-   ##
==========================================
- Coverage   89.83%   89.05%   -0.78%     
==========================================
  Files          96       97       +1     
  Lines        9118     9350     +232     
==========================================
+ Hits         8191     8327     +136     
- Misses        927     1023      +96

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 13 '24 18:03 codecov[bot]

@dafnapension @elronbandel - Can you explain the motivation for this PR? What are standard metrics and how do they relate to the existing metrics?

Mar 14 '24 22:03 yoavkatz

Current evaluation of a global metric starts by laying the whole stream in main memory, adding "next to it" a couple of hundreds copies thereof (for the re_samplings). This breaks the 'streaming' spirit of unitxt. We are trying to see if we can stream also the global metrics . To this end, we implement the following for each global metric: (1) an instance scorer to score each individual instance (like today's) (2) an accumulator that accumulates what it needs from each instance, but not copying the whole instance. E.g. F1 accumulates the confusion matrix (a count of occurrences of each pair of (ref, pred)) over all the instances. This counter is expected to be dramatically smaller than the size of the whole evaluated stream. (3) a function yielding the final global score from the accumulated value.

The resampling is somewhat more trickier: Today, we generate a single resample by selecting, with replacements, n instances from a stream of length n. We repeat that process the number of resamples we want to use. This process does not suit streaming. So we suggest: Given an instance i, for each resample r (that we want to learn from without first building it) we randomly pick the number of times b that i is to participate in r. poisson distribution for picking b, is exactly what we need here, being a close approximation of the binomial distribution that is induced by the usual selection with replacement.

Mar 14 '24 23:03 dafnapension

@elronbandel @dafnapension - I'm sure you discussed it between you alot, but I want to provide a different perspective.

I think stream in unitxt may be useful if unitxt used for large scale training - however, it also has significant cost in terms of code and API complexity. In evaluation , where typically only hundres of samples are tested, streaming will have no significant value.

We need metrics API that are (1) independent from each other (2) easy for users to add AND debug them.

Our direction should be of simplification and not making things more complex.

Therefore, I think it's worth to have a discussion if this direction will have a net gain in terms of unitxt acceptance.

(@eladven - will be glad your input as well).

Mar 15 '24 13:03 yoavkatz

Leave for now. If at all, continue via https://github.com/IBM/unitxt/pull/845

May 20 '24 16:05 dafnapension