structural-transformer icon indicating copy to clipboard operation
structural-transformer copied to clipboard

Preprocessing

Open QAQ-v opened this issue 5 years ago • 13 comments

Hi,

Could you please release the preprocessing codes for generating the structural sequence and the commands for applying bpe? i.e., how to get the files in corpus_sample/all_path_corpus and corpus_sample/five_path_corpus.

Thanks.

QAQ-v avatar Sep 18 '19 12:09 QAQ-v

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

Amazing-J avatar Sep 19 '19 03:09 Amazing-J

Python has an [“anytree”] (https://pypi.org/project/anytree/2.1.4/) . You can try.

Thanks for your reply! I am still confused about how to get the structural sequence, maybe releasing the preprocessing codes or the preprocessing data is a better way to help people run your model.

Meanwhile, there is another question. I trained Transformer baseline model implemented by OpenNMT with same hyperparameter setting as yours on LDC2015E86. When I compute the BLEU score on the BPE embedding prediction I can get a comparable result in Table 3 of your paper (25.5), but after I remove the "@@" in the prediction the BLEU droped a lot. So I am wondering that the BLEU results you reported in Table 3 was computed based on the BPE embedding prediction? Did you remove the "@@" in the final prediction of the model?

QAQ-v avatar Sep 26 '19 10:09 QAQ-v

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " ) The target side should do nothing but tokenization ( use PTB_tokenizer ).

Amazing-J avatar Sep 26 '19 11:09 Amazing-J

After deleting "@@ ", the BLEU value should not decline, but rise a lot. Are you sure you are doing the right BPE process? It is worth noting that not only "@@" but also a space has been deleted. ( "@@ " ) The target side should do nothing but tokenization ( use PTB_tokenizer ).

Thanks for your reply!

I follow the author's instruction to delete "@@ " (sed -r 's/(@@ )|(@@ ?$)//g') so there shouldn't be any mistakes. So you mean you only apply the BPE on source side? On the target side you do not apply the BPE? But in this way the source and target sides do not share the same alphabet, you still share the vocab in the model? Could you please release the code for BPE maybe it is more efficient and clear.

QAQ-v avatar Sep 26 '19 11:09 QAQ-v

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing.
BPE is a commonly used method in machine translation, there is no special code ah.

Amazing-J avatar Sep 26 '19 12:09 Amazing-J

What I mean is that the source and target segment needs to do BPE during training, and the target segment does not need to do BPE during testing. BPE is a commonly used method in machine translation, there is no special code ah.

Thanks for your patient reply!

I am still a little confused. So you only apply the BPE on the training set, and do not apply the BPE on the test set, is that right? Or you also apply the BPE on the source side of test set but do not apply BPE on the target side of test set?

QAQ-v avatar Sep 26 '19 12:09 QAQ-v

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Amazing-J avatar Sep 26 '19 12:09 Amazing-J

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Get it :). I will have a try, thanks!

QAQ-v avatar Sep 26 '19 13:09 QAQ-v

yes. During the test, only the source side needs to do BPE, and then test BLEU after deleting @@.

Sorry for bothering again, what is the {num_operations} set in the following command, the default value 10000?

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}

QAQ-v avatar Sep 26 '19 14:09 QAQ-v

On LDC2015E86 10000 On LDC2017T10 20000 train_file: cat train_source+train_target

Amazing-J avatar Sep 26 '19 14:09 Amazing-J

train_source+train_target

So you follow the instructions in BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT , right?

If it is, the --vocabulary-threshold you still keep 50?

QAQ-v avatar Sep 26 '19 14:09 QAQ-v

        You only need to use these two commands. subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}

subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

        On 09/26/2019 22:15, Will wrote: 

train_source+train_target

So you follow this instruction, right? If it is, the --vocabulary-threshold you still keep 50?

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread. [ { "@context": "http://schema.org", "@type": "EmailMessage", "potentialAction": { "@type": "ViewAction", "target": "https://github.com/Amazing-J/structural-transformer/issues/3?email_source=notifications\u0026email_token=AJC27CN6BB7MCH3VJZTQXPTQLS7YFA5CNFSM4IX6GGS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VW7OY#issuecomment-535523259", "url": "https://github.com/Amazing-J/structural-transformer/issues/3?email_source=notifications\u0026email_token=AJC27CN6BB7MCH3VJZTQXPTQLS7YFA5CNFSM4IX6GGS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7VW7OY#issuecomment-535523259", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { "@type": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

Amazing-J avatar Sep 26 '19 14:09 Amazing-J

@Amazing-J Hi! I have the same question regarding generating structural sequences. Can you provide more insight on how to use [“anytree”] (https://pypi.org/project/anytree/2.1.4/) to get corpus_sample/all_path_corpus and corpus_sample/five_path_corpus? Any example preprocessing code will be much appreciated!

Bobby-Hua avatar May 30 '22 02:05 Bobby-Hua