James
James
> I have noticed that WikiExtractor misses the articles that include a colon in the title. > For example, [Super Mario Advance 4: Super Mario Bros 3](https://en.wikipedia.org/wiki/Super_Mario_Advance_4:_Super_Mario_Bros._3) is currently ignored....
Seems like the variable ram usage is due to not including `filtertoolong` I'll check later to see if the speed difference between 1.2 and 2.0 is still present.
Your dataset might be too tiny. How many lines? Your tokenizing of Chinese characters by single words will not give good results. Try using HanLP and/or Sentencepiece.
It's very likely to do with this: ``` src_vocab: corpus/spm/en-zh.vocab.en tgt_vocab: corpus/spm/en-zh.vocab.zh ``` If the vocab is not done in the same tokenization method as your training files, this will...
The vocab size is 32000.