scpn icon indicating copy to clipboard operation
scpn copied to clipboard

How should I preprocess the data?

Open shuangqinbuaa opened this issue 7 years ago • 8 comments

If I just want to train the SCPN model, I just need to preprocess the para-nmt dataset. But what if I want to use SCPN to generate syntactically adversarial examples for downstream task? Should I preprocess (for example, tokenizing and BPE) the para-nmt dataset with the downstream task's dataset together? How did you preprocess SST and SICK data ? @miyyer @jwieting Thank you very much!

shuangqinbuaa avatar Jun 16 '18 12:06 shuangqinbuaa

Did you ever figure this out? It looks like they use a regular parse tree. But obviously it would be best to parse using the same process they did.

I'm talking about what's the expected method for parsing the input sentences for paraphrasing. To get the output

a person in a black jacket is doing tricks on a motorbike
(ROOT (S (NP (NP (DT A) (NN person)) (PP (IN in) (NP (DT a) (JJ black) (NN jacket)))) (VP (VBZ is) (VP (VBG doing) (NP (NNS tricks)) (PP (IN on) (NP (DT a) (NN motorbike))))) (. .)))

Henry-E avatar Aug 28 '18 11:08 Henry-E

Also I'm curious how to create templates for the generation aspect. They have 10 default templates in the demo script but it would be useful to understand how they created these in order to create new ones.

Henry-E avatar Aug 28 '18 11:08 Henry-E

The Stanford NLP constituency parser seems to work well. Though I am still curious about how to use different templates

Henry-E avatar Aug 28 '18 12:08 Henry-E

sorry for the enormously delayed response! we have added some functions to run on top of the corenlp output to make it easier to get your data into the right format (see extract_parses in read_paranmt_parses.py). @jwieting will soon add a file containing all of the templates in ParaNMT sorted by frequency so you can play around with more of them (in our paper, we use the top 20 most frequently-occurring templates).

miyyer avatar Sep 05 '18 22:09 miyyer

Hi, just a friendly reminder, any update on the templates?

kj-lai avatar Nov 27 '18 04:11 kj-lai

Hi @miyyer @jwieting, just a friendly reminder, could you kindly share how the paranmt dataset is preprocessed (tokenizing, BPE, etc.)? Thanks

zhengliz avatar Jan 18 '19 23:01 zhengliz

Hi @miyyer @jwieting, just a friendly reminder, could you kindly share how the paranmt dataset is preprocessed (tokenizing, BPE, etc.)? Thanks

I also want to know the BPE and tokenizing part.

LeeShiyang avatar Apr 01 '20 07:04 LeeShiyang

I also want to know about the templates!

santimarro avatar Apr 22 '20 13:04 santimarro