scpn How should I preprocess the data?

If I just want to train the SCPN model, I just need to preprocess the para-nmt dataset. But what if I want to use SCPN to generate syntactically adversarial examples for downstream task? Should I preprocess (for example, tokenizing and BPE) the para-nmt dataset with the downstream task's dataset together? How did you preprocess SST and SICK data ? @miyyer @jwieting Thank you very much!

Jun 16 '18 12:06 shuangqinbuaa

Did you ever figure this out? It looks like they use a regular parse tree. But obviously it would be best to parse using the same process they did.

I'm talking about what's the expected method for parsing the input sentences for paraphrasing. To get the output

a person in a black jacket is doing tricks on a motorbike
(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN in) (NP (DT a) (JJ black) (NN jacket)))) (VP (VBZ is) (VP (VBG doing) (NP (NNS tricks)) (PP (IN on) (NP (DT a) (NN motorbike))))) (. .)))

Aug 28 '18 11:08 Henry-E

Also I'm curious how to create templates for the generation aspect. They have 10 default templates in the demo script but it would be useful to understand how they created these in order to create new ones.

Aug 28 '18 11:08 Henry-E

The Stanford NLP constituency parser seems to work well. Though I am still curious about how to use different templates

Aug 28 '18 12:08 Henry-E

sorry for the enormously delayed response! we have added some functions to run on top of the corenlp output to make it easier to get your data into the right format (see extract_parses in read_paranmt_parses.py). @jwieting will soon add a file containing all of the templates in ParaNMT sorted by frequency so you can play around with more of them (in our paper, we use the top 20 most frequently-occurring templates).

Sep 05 '18 22:09 miyyer

Hi, just a friendly reminder, any update on the templates?

Nov 27 '18 04:11 kj-lai

Hi @miyyer @jwieting, just a friendly reminder, could you kindly share how the paranmt dataset is preprocessed (tokenizing, BPE, etc.)? Thanks

Jan 18 '19 23:01 zhengliz

Hi @miyyer @jwieting, just a friendly reminder, could you kindly share how the paranmt dataset is preprocessed (tokenizing, BPE, etc.)? Thanks

I also want to know the BPE and tokenizing part.

Apr 01 '20 07:04 LeeShiyang

I also want to know about the templates!

Apr 22 '20 13:04 santimarro