structural-transformer icon indicating copy to clipboard operation
structural-transformer copied to clipboard

Data Preprocessing

Open Cartus opened this issue 6 years ago • 5 comments

Hi, thanks for the great work!

I try to run the code. However, I don't know how to do data preprocessing for AMR corpus. May I ask how can I do data preprocessing?

Cartus avatar Sep 08 '19 07:09 Cartus

Our baseline input could be the same linearized amr chart as konstas. Only concept nodes are retained for input to the transformer model. -train_src # concept node sequence -train_structure1 # Xi to Xj path of the first token. -train_structure2 # Xi to Xj path of the second token. ........

Amazing-J avatar Sep 09 '19 01:09 Amazing-J

Hi @Amazing-J ,

Thank you for your prompt reply!

For the concept node sequence, I can use NeuralAmr https://github.com/sinantie/NeuralAmr to get the linearized sequence.

I also have two questions. The first one is how to construct the structural sequence. Since the model requires to sub-word units by BPE, how to generate the concept node sequence under this setting?

Cartus avatar Sep 09 '19 03:09 Cartus

Hi @Amazing-J,

Thank you for releasing the code! As @Cartus pointed out, can you provide the code for BPE over the source a.k.a linearized AMRs?

Best!

dungtn avatar Sep 23 '19 20:09 dungtn

Assuming that I've done the right thing for BPE by running

subword-nmt learn-bpe -s 10000 < ...LDC2015E86/training_source > codes.bpe subword-nmt apply-bpe -c codes.bpe < ...LDC2015E86/dev_source > dev_source_bpe

then I still got this error:

FileNotFoundError: [Errno 2] No such file or directory: ...LDC2015E86/data_vocab.pt

How can I generate this file?

dungtn avatar Sep 24 '19 03:09 dungtn

Alright, I found out that I also have to run preprocess.sh. Thanks!

dungtn avatar Sep 24 '19 04:09 dungtn