Sem-Dialogue icon indicating copy to clipboard operation
Sem-Dialogue copied to clipboard

tokenize and bpe for src and tgt in DialogRG

Open Bobby-Hua opened this issue 3 years ago • 5 comments

Hello! I have two questions regarding the processing of src and tgt files in DialogRG.

  1. For src and tgt, I wonder if your work followed Zhu et. al 2019 to use the PTBTokenizer for tokenization and then applied bpe using the code from subword-nmt? I plan to run bpe like the commands below, but I'm not sure about the num_operations you used.
subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {dev_file} > {out_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}
  1. I noticed DialogRG/dataset_utils.py also included a bert_tokenize(tokenizer, src) function. Does it mean we can also use bert_tokenizer instead of bpe on src and tgt files?

Thank you again for your reply to my previous questions!

Bobby-Hua avatar Apr 13 '22 01:04 Bobby-Hua

Hi,

In our scripts, we set BPE num_operation as 10000. We have tried BERT_tokenizer, but we don't observe significant improvements.

goodbai-nlp avatar Apr 13 '22 05:04 goodbai-nlp

Thank you! Is it possible to share the bpe code_files you derived from the DailyDialog train_file? I might be missing something, but my bpe output is still very different from the files you provided after using num_operations=10000.

Bobby-Hua avatar Apr 16 '22 01:04 Bobby-Hua

Sorry for my late reply, I can't find the exact codes file now, you can try this one. Also, please note that we do not apply bpe on AMR concepts follow Zhu et. al 2019.

goodbai-nlp avatar Apr 20 '22 14:04 goodbai-nlp

Thank you! Sorry to bother you again, but looks like I can't open this link, "you can try this one". Any chance you can share it through another platform like google drive?

Bobby-Hua avatar Apr 22 '22 18:04 Bobby-Hua

I have uploaded it to google drive here, please have a try.

goodbai-nlp avatar Apr 26 '22 06:04 goodbai-nlp