ConvLab-2
ConvLab-2 copied to clipboard
evaluate vs test
Hi, it's not clear the difference between the evaluate.py and test.py scripts in the NLU folder. What do they evaluate? Since the result obtained is completely different even if they take as input the same test set.
evaluate.py
uses the unified interface inherited from NLU
such as class BERTNLU(NLU)
. Each NLU model should provide such a class so that we can compare different models given the same inputs. test.py
is the test script for BERTNLU only, which may have different preprocessing. However, the difference should not be large.
Ok, so evaluate.py is used to compare the performance of different NLU while if I want to test only BERTNLU I should use test.py? It is not clear to me why test.py calls the functions is_slot_da, calculate_F1 and recover_intent while evaluate doesn't do that. On what basis the overall performance is computed, in evaluate.py, if neither slots nor intents are recovered? Thks
Yes, evaluate.py
is used to compare the performance of different NLU. it will be slower than test.py
since it uses batch_size=1. If you want to test only BERTNLU (e.g., to tune some hyper-parameters) and do not compare with other NLU, you can use test.py
for verification. The difference should not be large.
evaluate.py
will call recover_intent
in BERTNLU
: https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/nlu.py#L106. And calculate_F1
will be called by both evaluate.py
and test.py
-> https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/nlu/jointBERT/multiwoz/postprocess.py#L13
is_slot_da
decides whether a dialog act (intent, domain, slot, value) is non-categorical, which means the value is in the sentence and we use the slot-tagging method in BERTNLU to extract it (e.g., inform the name of a restaurant). if is_slot_da
is False
, we use [CLS] to do binary classification to judge if such a dialog act exists (e.g. request the name of a restaurant). We evaluate two kinds of dialog act and give slot F1
and intent F1
respectively. However, these metrics may not be applied to other NLU models such as a generative model, so it is not included in evaluate.py
In evaluate.py
we directly evaluate the dialog act F1, comparing two lists of tuple (intent, domain, slot, value).
Hi, thanks for your reply. Could you please specify if in test.py the Recall, Precision and F1 score were micro-averaged?
Yes, they are. TP, FP, FN are accumulated through the test set