esalib
esalib copied to clipboard
tutorial steps for importing wikipedia dump
Hi, I'm trying to follow the tutorial steps for updating the imported wikipedia snapshot and related data structures on a Linux workstation with Mysql but I'm having some troubles. My understanding is that PrepareWikiDb should be called several times during the importing with different argument (1=>drop_indexes, 2=>clear, 3=>rebuild_indexes, 4=>page_concepts...) in order to update the SQL db. Is it correct? Could you state the correct order? Thank you!
As usual when I'm trying to import Wiki dumps I end up into problems like:
ERROR 1366 (HY000) at line 32: Incorrect string value: '\xD0' for column 'll_title' at row 2577
Even if I set the CHARACTER SET to COLLATE utf8_general_ci. Am I the only person to get this error? I turned to an importer that can ignore the very few errors of this kind and continue importing the rest of the dataset.
5.6.21 MySQL Community Server (GPL) Ubuntu 14.04.1 LTS java version "1.7.0_72" 64bit
It runs sequentially, so if you specify 0, it should run all steps.
Unfortunately I'm unable to help you with the encoding problem. It seems that the problem is with the way you run the import. To solve these problems you really need to carefully debug yourself or get some help on the site like stackoverflow.com and describe the encoding problem in a greater detail.
Thank you for your support. One more quick question: I'm trying to completing the PrepareWiki precedure on a recent wikipedia dump. It's now stuck at step 8: "redirect_mapping" after 20 days of computation on a i7 with a SSD disk. The redirect_mapping table is quite small and it doesn't grow even if mysql is working on the cpu at 100%. Do you have any experience on how long this step usually takes to get done? Thank you again!
-rw-rw---- 1 mysql mysql 8612 Oct 21 16:34 redirect_mapping.frm -rw-rw---- 1 mysql mysql 14680064 Nov 11 17:01 redirect_mapping.ib
I'm sorry, I can't remember. It seems there might be a problem with the indexes on the tables. You should remove all indexes and build them afterwards.
On Tue, Nov 11, 2014 at 5:03 PM, fabiogasparetti [email protected] wrote:
Thank you for your support. One more quick question: I'm trying to completing the PrepareWiki precedure on a recent wikipedia dump. It's now stuck at step 8: "redirect_mapping" after 20 days of computation on a i7 with a SSD disk. The redirect_mapping table is quite small and it doesn't grow even if mysql is working on the cpu at 100%. Do you have any experience on how long this step usually takes to get done? Thank you again!
-rw-rw---- 1 mysql mysql 8612 Oct 21 16:34 redirect_mapping.frm -rw-rw---- 1 mysql mysql 14680064 Nov 11 17:01 redirect_mapping.ib
— Reply to this email directly or view it on GitHub https://github.com/ticcky/esalib/issues/7#issuecomment-62568751.
Lukas Zilka | web lukas.zilka.me | skype: lukaszilka
Hi, the indices does not work, if the datatypes are different. You have to alter the script https://github.com/ticcky/esalib/blob/master/scripts/mediawiki.sql in reference to the new Wikimedia schema (e.g. the new page table: http://www.mediawiki.org/wiki/Manual:Page_table) my script for reference: http://hastebin.com/weyoqawemu.sql
step 8 fails, because page_title in the old page-table-schema has varchar(255) and in the new it has varbinary(255). no index will be used, if you try to compare varchar(255) with varbinary(255)
PrepareWikiDB.java finished in under an hour on my machine. (i5, 4gb ram)