How to train and evaluate policy models with unified dataset format?
Hi there, I noticed that there are APIs to load NLU, DST, Policy and NLG data in unified data format. Besides, I found the training and evaluation guide for NLU/DST/NLG with unified data in $model/README.md or NLU/DST/NLG/evaluate_unified_datasets.py. However, I did not find a guide for how to train and evaluate policy models with unified data format. Specifically, I have the following questions:
- Training: I did not find support for training with unified data format in $policy_model/train.py, such as ppo/train.py and mle/train.py, it seems that they will use MultiWozEvaluator by default.
- Evaluation: I did not find support for evaluation with unified data format in policy/evaluate.py, it seems that it will also use MultiWozEvaluator by default.
- My Training Experiment: I have tried to train a PPO policy with this config file base_pipeline_rule_user.json (which has been initialized with a MLE policy weight trained with default config), and get the result: Best Complete Rate: 0.95, Best Success Rate: 0.5, Best Average Return: 4.5. It is a good start for me, but still worser than BERTNLU | RuleDST | PPOPolicy | TemplateNLG evaluation in ConvLab2 ReadME (75.5 completion rate and 71.7 success rate). How does this gap come from?
- My Evaluation Experiment: I evaluated my previously trained PPO model policy/evaluate.py, but get a much worser result: "Complete 500 0.372 Success 500 0.228 Success strict 500 0.174". During the evaluation, there are two warnings: "Value not found in standard value set: [dontcare] (slot: name domain: restaurant)", "Value [none] invalid! (Lexicalisation Error) (slot: name domain: hotel)". They seem to be the dataset format mismatch between training and evaluation process, because I am not sure whether I have used original Multiwoz format or unified data format to train and evaluate my policy model.
- For user simulator: I have found that tus, emoUS and genTUS could be trained and evaluated with unified data format. However, I did not found unified data format support in rule-based user simulator. Does that mean if I trained my models(NLU/NLG or Policy) with unified data format, I could not evaluate them with rule-based user simulator?
Looking forward to your reply, James Cao
Another question is I found that my trained PPO policy will give tens of system act output in every turn, is it expected?
@ChrisGeishauser could you give some guidance?
Hi @JamesCao2048, thanks a lot for all your questions! I hope I can answer them sufficiently for you:
- For MLE training, this is explained in the README: https://github.com/ConvLab/ConvLab-3/tree/master/convlab/policy/mle So when you execute train.py, you just pass --dataset_name=sgd and it should work. For the DDPT model (in folder vtrace_DPT), it is also explained how to specify the dataset, namely in the pipeline configuration under "vectorizer_sys" you set "dataset_name" = "sgd". For PPO, it should be the very same as for DDPT (even though I have not checked it yet). But as you found out, there is at the moment only an evaluator for MultiWOZ unfortunately, so currently RL training is only possible for MultiWOZ. We are working on an SGD evaluator and hope to finish that soon.
- You are right, there is only a multiwoz evaluator at the moment unfortunately, but we are working on an SGD evaluator.
- If the policy is loaded correctly, there should be an output in the terminal at the beginning like "dialogue policy loaded from checkpoint ...". If you do not see that, it is not loaded correctly. You have to be a bit careful here because you should not set the "load_path" as "save/best_ppo.pol.mdl" but "save/best_ppo" because the policy tries to load both policy and critic. Sorry for that confusion! Please check whether the model is loaded correctly and otherwise contact me again! This hopefully closes the gap then.
- This is definitely the performance of a randomly initialised policy. Please check if the policy is loaded correctly (see point 3 above)
- This is correct, unfortunately the rule-based simulator only supports Multiwoz at the moment.
Another question is I found that my trained PPO policy will give tens of system act output in every turn, is it expected?
This is an indicator that you used a random policy, in this case the output is expected: the architecture of the policy has an output dimension that is equal to the number of "atomic actions" (e.g. hotel-inform-phone or restaurant-request-price). For every "atomic actions" there is a binary decision wether to use it or not. In case of a random policy, there is roughly a chance of 50% to take an atomic actions, which will lead to a lot of actions.
I hope I could help you with the answers! Let me know if something is unclear.
@ChrisGeishauser sorry for bothering, do you have any estimates on when will the evaluator class be ready?
another thing, the vectorizers seem to work only on the mutliwoz dataset