MultiWOZ_Evaluation icon indicating copy to clipboard operation
MultiWOZ_Evaluation copied to clipboard

Add MultiWOZ 2.4 DST evaluation with leave-one-out cross-validation support

Open WeixuanZ opened this issue 2 years ago • 12 comments

We aim to conduct DST evaluation on the MultiWOZ 2.4 corpus. This PR shows our proposed extension to the existing code to achieve this.

WeixuanZ avatar Feb 15 '23 14:02 WeixuanZ

Oops, meant to create a PR in my fork. I'll reopen this PR once it is reviewed in my fork.

WeixuanZ avatar Feb 15 '23 14:02 WeixuanZ

@smartyfh I would greatly appreciate your comments on this!

WeixuanZ avatar Feb 26 '23 22:02 WeixuanZ

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

smartyfh avatar Feb 27 '23 10:02 smartyfh

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

WeixuanZ avatar Feb 27 '23 16:02 WeixuanZ

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

smartyfh avatar Feb 27 '23 18:02 smartyfh

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

alexcoca avatar Feb 28 '23 14:02 alexcoca

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

Hi @alexcoca, my pleasure. NONE is not a special value. When either a slot is not mentioned or its value has been deleted, the value is NONE. "Not Mentioned" is another value that is also used to indicate ''not mentioned'' slots. So it is safe to change "not mentioned" to "none". Regarding the last question, it sounds like a good option to add a flag. Cheers!

smartyfh avatar Feb 28 '23 15:02 smartyfh

@smartyfh I would greatly appreciate your comments on this!

Thank you, Weixuan. That would be great. Please let me know what I can do.

@smartyfh Thanks! It would be great if you could have a look at our dataset loading logic and let us know if there is anything that we may have missed/done differently. Especially, your insights on whether slots with value none should be dropped (https://github.com/WeixuanZ/MultiWOZ_Evaluation/blob/55fb7c26a7b6ecc6d62b7a068c1f890bc9e3f2e4/mwzeval/utils.py#L214).

I don't fully understand why the NONE value should be removed. When we evaluate the performance of DST, we should take all slots into account. It seems to be easier to keep all the slots and their values. If we remove the NONE values, we need to take care of post-processing when calculating evaluation metrics.

Hi @smartyfh , thanks so much for engaging! To clarify, does the none value indicate that the user did not yet mention a slot value or is it a special value that indicates slot "deletion"? Our reason to "remove" it is that the authors of D3ST (https://arxiv.org/pdf/2201.08904.pdf) ignored it during pre-processing and so we ought to do so during post-processing. To make the evaluator implementation agnostic, should we add a flag that states whether none should be removed or not? In this way, future users who did not pre-process their data to remove none slot values will be able to fairly evaluate their models too?

Hi @alexcoca, my pleasure. NONE is not a special value. When either a slot is not mentioned or its value has been deleted, the value is NONE. "Not Mentioned" is another value that is also used to indicate ''not mentioned'' slots. So it is safe to change "not mentioned" to "none". Regarding the last question, it sounds like a good option to add a flag. Cheers!

Thanks so much @smartyfh !

WeixuanZ avatar Mar 03 '23 17:03 WeixuanZ

Hi all!

I was interested in using the MultiWOZ 2.4 evaluator implemented here. For my understanding, is the implementation in this PR complete & correct, and any remaining changes would be API and documentation improvements? If so I may try and use it, and could possibly even help resolve remaining issues if the PR is no longer active.

Thanks all for building such a useful tool! The discussion here has also been helpful for my understanding of the evaluation process.

kingb12 avatar Oct 02 '23 16:10 kingb12

@kingb12 This sounds like a great idea. Let's have a fresh look at the work and maybe ask one or two additional MultiWOZ experts to validate the work to be extra sure the evaluator is correct. @Tomiinek , apart from yourself, who do you think would be suited to sing off this evaluation PR?

alexcoca avatar Oct 03 '23 09:10 alexcoca

Hey guys,

I think I do not have enough capacity to meaningfully and thoroughly review (it has been a long time since I did something related to dialogs). If you guys are going to test it and finish the remaining bits, I would be more than happy to merge it ... just ping me.

Maybe @tuetschek, @vojtsek or @oplatek could chimme in

Tomiinek avatar Oct 06 '23 13:10 Tomiinek

Thanks all! I can work on the remaining feedback from the open PR, testing, etc, unless someone else would prefer to. I'm also working on a few other things so it may take me a week or so, but I wanted to gauge whether this would be helpful. Appreciate the response and comments!

kingb12 avatar Oct 06 '23 15:10 kingb12