pointer-generator
pointer-generator copied to clipboard
Is it more appropriate to evaluate the test results using SPICE?
I found this paper touting that " Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating hu-man judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE".