lstm icon indicating copy to clipboard operation
lstm copied to clipboard

QUESTION: How to run on a different treebank corpus

Open loretoparisi opened this issue 8 years ago • 1 comments

How to run on a specific treebank text corpus, like Italian treebank corpus? Thank you.

loretoparisi avatar Apr 28 '16 05:04 loretoparisi

Before running the corpus, you need to preprocess it in the same manner it is done for English. Here is what your corpus should be:

  1. Single sentence per line.
  2. Lowercase.
  3. Strip punctuation (punctuation decreases perplexity skewing the results, so I'd rather remove it).
  4. Replace singleton words with <unk> label.
  5. Replace each digit with N or 0.

Split your training corpus into ptp.train.txt, ptp.test.txt and ptp.valid.txt, where ptp.test.txt and ptp.valid.txt are 5% each of ptp.train.txt size.

tastyminerals avatar Sep 01 '16 07:09 tastyminerals