mediawiki-xml2sql
mediawiki-xml2sql copied to clipboard
Missing delimiters and NULL placeholders
xml2sql output is faulty for missing columns
If I run xml2sql i get the following output for revision:
$ head -n 6 revision.sql
-- xml2sql - MediaWiki XML to SQL converter
-- Table revision for PostgreSQL
COPY "revision" FROM STDIN;
287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 2009-08-19T03:20:58Z 1 0
32885 10 32885 +jamii 160 Baba Tabita 2007-02-02T10:14:01Z 1 0
If i try to import this i get the following error:
psql -h localhost -d test2 -U babilen -W < revision.sql
Password for user babilen:
ERROR: missing data for column "rev_len"
CONTEXT: COPY revision, line 1: "287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 20090819032058 1 0"
The issue can be fixed if the following awk oneliner is applied to the txt files (or lines within revision.sql that hold values)
awk -v e=10 'BEGIN { FS = OFS = "\t"; e++ } NF > e { exit(1) } { t = ""; for (a = 0; a < e - NF; a++) t = t "\t\\N"; printf "%s%s\n", $0, t }'
The difference in output is: 287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 2009-08-19T03:20:58Z 1 0 287059 2 287059 roboti Nyongeza: [[tpi:Akiolosi]] 988 LaaknorBot 2009-08-19T03:20:58Z 1 0 \N \N
The \N are placeholders for NULL values which is OK since the respective columns (rev_len and rev_parent_id) are defined as nullable:
CREATE TABLE revision (
....
rev_len INTEGER NULL,
rev_parent_id INTEGER NULL
);
thanks
babilen
another possibility to solve this is to integrate the modification into xml2sql instead of postprocessing its output. The patch can be found here: http://meta.wikimedia.org/wiki/Xml2sql#patch_for_recent_versions_of_mw_.28.3E.3D1.10.29
Correct. I have forked xml2sql and fixed the issues I reported here. You might want to take a look at:
http://github.com/babilen/mediawiki-xml2sql