Wikia-and-Wikipedia-EL-Dataset-Creator icon indicating copy to clipboard operation
Wikia-and-Wikipedia-EL-Dataset-Creator copied to clipboard

wikiextractor bug

Open ujiuji1259 opened this issue 3 years ago • 2 comments

Thank you for releasing a useful dataset!

I also created wikification dataset from Japanese wikipadia and found that there are two bugs in wikiextractor. First, the articles that include a colon in the title, such as 未来日記-ANOTHER:WORLD- are ignored. Second, some articles have different page id, e.x. 華麗なるファンタジア's page_id must be 3688400 but 3688399. Does this happen in your dataset, too??

If so, I can share my fixed codes if you need them! I sent a pull request to wikiextractor but my pull requests aren't merged yet...

By the way, can I write issues in Japanese??

ujiuji1259 avatar May 13 '21 18:05 ujiuji1259

Hi, thanks for your interests! I just added preprocessed dataset from ja-wiki. Please check it out. If I have time, I would like to create a dataset with wikiextractor that merges your pull requests. Currently page-ids are not dumped to final dataset, but I'll check later.

It's ok using Japanese if you want.

izuna385 avatar May 15 '21 12:05 izuna385

Thank you for sharing the preprocessed data!

I confirmed that doc_title2sents in preprocessed_jawiki.zip didn't contain any articles that include a colon in the title. I'll inform you when my PR is merged. Thanks.

ujiuji1259 avatar May 17 '21 04:05 ujiuji1259