Wikia-and-Wikipedia-EL-Dataset-Creator
Wikia-and-Wikipedia-EL-Dataset-Creator copied to clipboard
wikiextractor bug
Thank you for releasing a useful dataset!
I also created wikification dataset from Japanese wikipadia and found that there are two bugs in wikiextractor.
First, the articles that include a colon in the title, such as 未来日記-ANOTHER:WORLD-
are ignored. Second, some articles have different page id, e.x. 華麗なるファンタジア's page_id must be 3688400
but 3688399
.
Does this happen in your dataset, too??
If so, I can share my fixed codes if you need them! I sent a pull request to wikiextractor but my pull requests aren't merged yet...
By the way, can I write issues in Japanese??
Hi, thanks for your interests! I just added preprocessed dataset from ja-wiki. Please check it out. If I have time, I would like to create a dataset with wikiextractor that merges your pull requests. Currently page-ids are not dumped to final dataset, but I'll check later.
It's ok using Japanese if you want.
Thank you for sharing the preprocessed data!
I confirmed that doc_title2sents in preprocessed_jawiki.zip
didn't contain any articles that include a colon in the title.
I'll inform you when my PR is merged. Thanks.