multi-value slot values
Thanks for releasing code!
I'm trying to understand how this implementation handles multi-value slot values introduced in MultiWOZ 2.1 (such as e.g. cheap|moderate). It appears that this implementation considers a slot to be correctly predicted if at least one of gold slot values is predicted (looking at this code [here])(https://github.com/Yushi-Hu/IC-DST/blob/main/evaluate_metrics.py#L58). I did not see similar handling of multi-values in code releases for prior work (e.g. https://github.com/jshin49/ds2 or https://github.com/chiahsuan156/DST-as-Prompting or https://github.com/facebookresearch/Zero-Shot-DST/tree/main/TransferQA). Can you please comment on that?
Hi Dzmitry,
Thanks for your interest in our work! Yes, you are correct that in this implementation, a slot is considered to be correctly predicted if at least one of the gold slot values is predicted. I followed the evaluation pipeline from a really popular prior work TODBERT (The evaluation implementation is in the "evaluate" function here). The prior work on MultiWOZ 2.4 also follows this ASSIST-DST
For MultiWOZ 2.1 and 2.2 this does not make much difference because these multi-value slots are not annotated well, and for most of the cases, only one value is there. I think that's the reason that most prior works just ignore this issue. For MultiWOZ 2.4 this makes a bigger difference because the annotators find that many slots actually have multiple values. Now in DST tasks, people are assuming that each slot only corresponds to one value. I totally agree that we should rethink carefully on this assumption.
Thanks Yushi for your fast response!
Yes, indeed SimpleTOD evaluation code compares multi-values in the same way as yours. Do you think some other implementations (including ASSIST-DST and the links I posted above) are effectively more strict and require the entire multi-value literal to be predicted correctly?
Thanks for the explanation about 2.1 and 2.4, I will take a look at the exact percentage of multi-values in different MultiWOZ versions.
As for what the right evaluation approach should be, that depends on the exact semantics of the "|" operator. My understanding is that if "|" is logical OR, then all values should be predicted correctly. But if somewhere in the dataset it is used to indicate alternative spellings, then the "one-of" evaluation approach would be more appropriate. As usual it all boils down to there being a consistent and well-documented annotation approach, something that MultiWOZ still seems to be lacking.
Thanks rizar for your insights!
As for the first question, I checked some implementations, and in most cases, they didn't handle the multi-label scenario carefully. Some implementations just use the first possible value as the gold answer. I agree with the way ASSIST-DST handles the problem ---- normalize the labels by sorting the possible values. It effectively gives a more strict evaluation.
For your second comment, I totally agree! It boils down to the need for a well-documented annotation approach.