Is it more appropriate to evaluate the test results using SPICE?

Open eduOS opened this issue 7 years ago • 0 comments

I found this paper touting that " Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating hu-man judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE".

Jan 03 '18 09:01 eduOS