Sem-Dialogue
                                
                                 Sem-Dialogue copied to clipboard
                                
                                    Sem-Dialogue copied to clipboard
                            
                            
                            
                        tokenize and bpe for src and tgt in DialogRG
Hello! I have two questions regarding the processing of src and tgt files in DialogRG.
- For src and tgt, I wonder if your work followed Zhu et. al 2019 to use the PTBTokenizer for tokenization and then applied bpe using the code from subword-nmt? I plan to run bpe like the commands below, but I'm not sure about the num_operationsyou used.
subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {dev_file} > {out_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}
- I noticed DialogRG/dataset_utils.pyalso included abert_tokenize(tokenizer, src)function. Does it mean we can also use bert_tokenizer instead of bpe on src and tgt files?
Thank you again for your reply to my previous questions!
Hi,
In our scripts, we set BPE num_operation as 10000. We have tried BERT_tokenizer, but we don't observe significant improvements.
Thank you! Is it possible to share the bpe code_files you derived from the DailyDialog train_file? I might be missing something, but my bpe output is still very different from the files you provided after using num_operations=10000.
Sorry for my late reply, I can't find the exact codes file now, you can try this one. Also, please note that we do not apply bpe on AMR concepts follow Zhu et. al 2019.
Thank you! Sorry to bother you again, but looks like I can't open this link, "you can try this one". Any chance you can share it through another platform like google drive?
I have uploaded it to google drive here, please have a try.