MathVista Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most_similar`

Open mattmazzola opened this issue 1 year ago • 0 comments

I was debugging an issue with our model outputting empty responses for all questions and noticed the accuracy score was still 22% when I expected it should be 0%.

I debug further and found that for multi_choice questions there is a path that computes Levenshtein distance but doesn't guard against empty inputs meaning it would output a valid choice regardless. (Likely the choice with the least amount of characters which would be the minimum edit distance, or first choice if all equal length)

https://github.com/lupantech/MathVista/blob/82f68d09b4cbffe9d0dfd7542c599810e30c9a99/evaluation/calculate_score.py#L45-L51

https://github.com/lupantech/MathVista/blob/82f68d09b4cbffe9d0dfd7542c599810e30c9a99/evaluation/calculate_score.py#L14-L20

I also saw there was a questionable Exception handling when coercing the input value to a string. It assigns an empty string and continues. I think it should exiting early and return None. This assignment of empty string could further contribute to the issue above, for multiple choice problems where the extraction is not a string

https://github.com/lupantech/MathVista/blob/82f68d09b4cbffe9d0dfd7542c599810e30c9a99/evaluation/calculate_score.py#L30-L36

Video Demonstration

https://youtu.be/vj07WRvcLDw

Feb 14 '24 23:02 mattmazzola

MathVista MathVista copied to clipboard

Possible Bug in calculate_score.py, empty responses or extractions results in non-empty normalized extraction due to `get_most_similar`

Video Demonstration

MathVista
MathVista copied to clipboard