Yoav Katz
Yoav Katz
Hi Dafna. Thanks. Instead of calling the hf inference engine, you can just copy the "target" of instance i to the prediction of instance i+1. This will simulate a model,...
See example here https://github.com/IBM/unitxt/blob/c2fc7ab4caeac1e48d523a34cc34a0cdcc597d16/examples/evaluate_llm_as_judge.py#L43
Hhi Dafna. Can you look at all the instance scores and not only the first ? Perhaps there is one instance with a big difference that affects the whole average....
> The difference you are looking at (which you bolded in [#1078 (comment)](https://github.com/IBM/unitxt/issues/1078#issuecomment-2257871344)) is not 0.7, it is 0.0007. You are right (I meant 0.07 points and which is 0.0007)...
This is the rouge code: https://huggingface.co/spaces/evaluate-metric/rouge/blob/e2671c0764b07f287918af2338dfbd162c14cd07/rouge.py#L121
Now I understand. Thank you. I also talked with Elron. Since people are used to the HF Rouge score, we need to be comparable with it. One way to do...
The second option may e simpler: Just change the code here: What do you think?
We want to make it simple - and backward compatible. Later we can change. So we suggest 1 ) have a flag in metric `override_score_with_ci_mid` which will now only be...
Yes. That's the default there.
I'm not sure there is any bug in the code. I believe the problem is that when a a single metric calculation of global metrics takes a few seconds, it...