James comments

Results 5 comments of


                                            James

Missing the articles that include a colon in the title

> I have noticed that WikiExtractor misses the articles that include a colon in the title. > For example, [Super Mario Advance 4: Super Mario Bros 3](https://en.wikipedia.org/wiki/Super_Mario_Advance_4:_Super_Mario_Bros._3) is currently ignored....

VRAM usage not constant in v2.0.0rc1

Seems like the variable ram usage is due to not including `filtertoolong` I'll check later to see if the speed difference between 1.2 and 2.0 is still present.

The content in the pred.txt is repetitive

Your dataset might be too tiny. How many lines? Your tokenizing of Chinese characters by single words will not give good results. Try using HanLP and/or Sentencepiece.

predict result repeat the same result

It's very likely to do with this: ``` src_vocab: corpus/spm/en-zh.vocab.en tgt_vocab: corpus/spm/en-zh.vocab.zh ``` If the vocab is not done in the same tokenization method as your training files, this will...

Embedding shape / Vocab size

The vocab size is 32000.