Sem-Dialogue tokenize and bpe for src and tgt in DialogRG

tokenize and bpe for src and tgt in DialogRG

Open Bobby-Hua opened this issue 3 years ago • 5 comments

Hello! I have two questions regarding the processing of src and tgt files in DialogRG.

For src and tgt, I wonder if your work followed Zhu et. al 2019 to use the PTBTokenizer for tokenization and then applied bpe using the code from subword-nmt? I plan to run bpe like the commands below, but I'm not sure about the num_operations you used.

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {dev_file} > {out_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

I noticed DialogRG/dataset_utils.py also included a bert_tokenize(tokenizer, src) function. Does it mean we can also use bert_tokenizer instead of bpe on src and tgt files?

Thank you again for your reply to my previous questions!

Apr 13 '22 01:04 Bobby-Hua

Hi,

In our scripts, we set BPE num_operation as 10000. We have tried BERT_tokenizer, but we don't observe significant improvements.

Apr 13 '22 05:04 goodbai-nlp

Thank you! Is it possible to share the bpe code_files you derived from the DailyDialog train_file? I might be missing something, but my bpe output is still very different from the files you provided after using num_operations=10000.

Apr 16 '22 01:04 Bobby-Hua

Sorry for my late reply, I can't find the exact codes file now, you can try this one. Also, please note that we do not apply bpe on AMR concepts follow Zhu et. al 2019.

Apr 20 '22 14:04 goodbai-nlp

Thank you! Sorry to bother you again, but looks like I can't open this link, "you can try this one". Any chance you can share it through another platform like google drive?

Apr 22 '22 18:04 Bobby-Hua

I have uploaded it to google drive here, please have a try.

Apr 26 '22 06:04 goodbai-nlp

Sem-Dialogue Sem-Dialogue copied to clipboard

tokenize and bpe for src and tgt in DialogRG

Sem-Dialogue
Sem-Dialogue copied to clipboard