DualRL icon indicating copy to clipboard operation
DualRL copied to clipboard

pseudo-parallel data for GYAFC

Open bpucla opened this issue 5 years ago • 2 comments

Thank you for this great work!

It seems it's not straightforward to apply the template-based method to the informal-formal dataset since there're no clear attribute markers as those in the yelp dataset. Could you please share more details on how you prepared the pseudo-parallel data for the informal-formal transfer task? Also, I'd really appreciate it if you can share a few examples of the pseudo pairs resulting from the template-based method.

bpucla avatar Jun 24 '19 22:06 bpucla

The templates used to generate pseudo-parallel data are some heuristic rules. For example, the templates (or rules) for informal-to-formal text transfer includes:

  • Capitalize the first word and proper nouns. For example, i love it => I love it
  • Remove repeated punctuations. For example, wow!!!!! => wow
  • Handcraft a list of expansion for acronyms, etc.

More details can be found in the original paper of GYAFC dataset [1].

ps: We also try other methods to generate pseudo-parallel data for GYAFC. For example, JS similarity and Li et al., 2018. Although these methods are not perfect, they can also provide a not bad initialization for the model and a slight warm-start for DualRL training. And the final results don't differ much.

[1] Sudha Rao and Joel R. Tetreault. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of NAACL, 2018.

luofuli avatar Jun 26 '19 14:06 luofuli

hi, thanks for the explanations. Could you also put the templates based outputs in the code base so that others can directly use? Those rules can be very complex and misc so that replication could be very hard. Thanks!

jind11 avatar Dec 22 '19 02:12 jind11