lstm
lstm copied to clipboard
QUESTION: How to run on a different treebank corpus
How to run on a specific treebank text corpus, like Italian treebank corpus? Thank you.
Before running the corpus, you need to preprocess it in the same manner it is done for English. Here is what your corpus should be:
- Single sentence per line.
- Lowercase.
- Strip punctuation (punctuation decreases perplexity skewing the results, so I'd rather remove it).
- Replace singleton words with
<unk>
label. - Replace each digit with
N
or0
.
Split your training corpus into ptp.train.txt
, ptp.test.txt
and ptp.valid.txt
, where ptp.test.txt
and ptp.valid.txt
are 5% each of ptp.train.txt
size.