openlibrary icon indicating copy to clipboard operation
openlibrary copied to clipboard

fix(data): Correct wikidata dump format to raw TSV

Open KrishnaShuk opened this issue 1 month ago • 6 comments

Closes #11439

[fix] Corrects the format of the Wikidata data dump to make it easily parsable with standard command-line tools.

Technical

The dump-wikidata.sql script was using COPY ... WITH (FORMAT csv), which has strict quoting rules. This caused the JSON data field to be wrapped in unnecessary double quotes, breaking parsers like jq.

This change updates the script to use a plain COPY ... TO STDOUT command. This exports the data in PostgreSQL's default raw, tab-separated format, which does not add the extra quotes around the JSON field.

This fix aligns the Wikidata dump with the format of other Open Library dumps and resolves the reported parsing issue.

Testing

I have verified this fix locally by performing the following steps:

  1. Populated the local wikidata table by adding a Wikidata ID (e.g., Q42) to an author.
  2. Ran the original command (... WITH (FORMAT csv)) and confirmed it produced the incorrect, double-quoted JSON output.
  3. Ran the new, corrected command (COPY wikidata TO STDOUT;) and confirmed it produced a clean, raw TSV output without the extra quotes.

The fix behaves exactly as expected.

Follow-up Enhancements

  1. Reordering the columns to place the large JSON field last.
  2. Trimming the microsecond precision from the timestamp.

Stakeholders

@cdrini @tfmorris

KrishnaShuk avatar Nov 14 '25 05:11 KrishnaShuk

I've replicated the error and have identified the root cause. The problem is a systemic double-escaping of quotes (\") within the json data itself.

jq: parse error: Invalid numeric literal at line 1, column 129522

The above error is shown because of this line ->

"Entity[\"Person\", \"DouglasAdams::g

in the json data (if you are using Qid 48 as data source).

I confirmed this is not an isolated case by fetching several other Wikidata records (Q410, etc.), and they all contain the same error.

The solution is to clean the data on the fly during the dump. I plan to modify scripts/oldump.sh by adding a sed command to the pipeline, which corrects the escaping before the data is compressed:

... psql ... | sed 's/\\\\"/\\"/g' | gzip ...

I have verified this fix locally on a multi-record dump, and the output now parses perfectly with jq. If this approach sounds good, I can proceed with updating the PR.

KrishnaShuk avatar Nov 15 '25 08:11 KrishnaShuk

Hmmm, I would have thought the best solution would be to fix the stored data so that the column contains pure JSON, but I don't know how it's consumed (or produced).

Your sed based solution sounds like a good workaround in the mean time. Thanks for investigating and coming up with it.

p.s. Don't forget about the column ordering, please. I think the column names can just be added to the COPY command.

tfmorris avatar Nov 15 '25 19:11 tfmorris

thanks for the review. You're right about fixing the data at the source being the ideal long-term solution. For this PR, I'm using the sed workaround as a step forward for making the dump immediately usable.

I've now updated dump-wikidata.sql to export the columns in the order (id, updated, data), which places the large JSON field last as you suggested.

The changes have been pushed to the PR. Please let me know if there's anything else.

KrishnaShuk avatar Nov 15 '25 20:11 KrishnaShuk

That sounds good to me, but you'll need to get a review from someone with commit privileges (ie not me).

If you haven't done so already, I'd suggest updating your gist so that they can see example output. Github seems to think that there's "illegal quoting in line 1" for that gist.

tfmorris avatar Nov 20 '25 17:11 tfmorris

thanks! i have updated gist files. @cdrini, this should now be ready for a final review when you have a moment.

KrishnaShuk avatar Nov 20 '25 17:11 KrishnaShuk

Hi @cdrini, just a friendly bump on this. tfmorris reviewed the approach. It should be ready for a final look and merge. Thanks!

KrishnaShuk avatar Dec 06 '25 06:12 KrishnaShuk