esalib icon indicating copy to clipboard operation
esalib copied to clipboard

tutorial steps for importing wikipedia dump

Open fabiogasparetti opened this issue 10 years ago • 4 comments

Hi, I'm trying to follow the tutorial steps for updating the imported wikipedia snapshot and related data structures on a Linux workstation with Mysql but I'm having some troubles. My understanding is that PrepareWikiDb should be called several times during the importing with different argument (1=>drop_indexes, 2=>clear, 3=>rebuild_indexes, 4=>page_concepts...) in order to update the SQL db. Is it correct? Could you state the correct order? Thank you!

As usual when I'm trying to import Wiki dumps I end up into problems like:

ERROR 1366 (HY000) at line 32: Incorrect string value: '\xD0' for column 'll_title' at row 2577

Even if I set the CHARACTER SET to COLLATE utf8_general_ci. Am I the only person to get this error? I turned to an importer that can ignore the very few errors of this kind and continue importing the rest of the dataset.

5.6.21 MySQL Community Server (GPL) Ubuntu 14.04.1 LTS java version "1.7.0_72" 64bit

fabiogasparetti avatar Oct 20 '14 09:10 fabiogasparetti

It runs sequentially, so if you specify 0, it should run all steps.

Unfortunately I'm unable to help you with the encoding problem. It seems that the problem is with the way you run the import. To solve these problems you really need to carefully debug yourself or get some help on the site like stackoverflow.com and describe the encoding problem in a greater detail.

ticcky avatar Oct 20 '14 10:10 ticcky

Thank you for your support. One more quick question: I'm trying to completing the PrepareWiki precedure on a recent wikipedia dump. It's now stuck at step 8: "redirect_mapping" after 20 days of computation on a i7 with a SSD disk. The redirect_mapping table is quite small and it doesn't grow even if mysql is working on the cpu at 100%. Do you have any experience on how long this step usually takes to get done? Thank you again!

-rw-rw---- 1 mysql mysql 8612 Oct 21 16:34 redirect_mapping.frm -rw-rw---- 1 mysql mysql 14680064 Nov 11 17:01 redirect_mapping.ib

fabiogasparetti avatar Nov 11 '14 16:11 fabiogasparetti

I'm sorry, I can't remember. It seems there might be a problem with the indexes on the tables. You should remove all indexes and build them afterwards.

On Tue, Nov 11, 2014 at 5:03 PM, fabiogasparetti [email protected] wrote:

Thank you for your support. One more quick question: I'm trying to completing the PrepareWiki precedure on a recent wikipedia dump. It's now stuck at step 8: "redirect_mapping" after 20 days of computation on a i7 with a SSD disk. The redirect_mapping table is quite small and it doesn't grow even if mysql is working on the cpu at 100%. Do you have any experience on how long this step usually takes to get done? Thank you again!

-rw-rw---- 1 mysql mysql 8612 Oct 21 16:34 redirect_mapping.frm -rw-rw---- 1 mysql mysql 14680064 Nov 11 17:01 redirect_mapping.ib

— Reply to this email directly or view it on GitHub https://github.com/ticcky/esalib/issues/7#issuecomment-62568751.

Lukas Zilka | web lukas.zilka.me | skype: lukaszilka

ticcky avatar Nov 11 '14 16:11 ticcky

Hi, the indices does not work, if the datatypes are different. You have to alter the script https://github.com/ticcky/esalib/blob/master/scripts/mediawiki.sql in reference to the new Wikimedia schema (e.g. the new page table: http://www.mediawiki.org/wiki/Manual:Page_table) my script for reference: http://hastebin.com/weyoqawemu.sql

step 8 fails, because page_title in the old page-table-schema has varchar(255) and in the new it has varbinary(255). no index will be used, if you try to compare varchar(255) with varbinary(255)

PrepareWikiDB.java finished in under an hour on my machine. (i5, 4gb ram)

bernhardreisenberger avatar Dec 23 '14 09:12 bernhardreisenberger