import-wikidata-dump-to-couchdb
import-wikidata-dump-to-couchdb copied to clipboard
import a subset or a full Wikidata dump into a CouchDB database
import-wikidata-dump-to-couchdb
A tool to transfer an extract of a wikidata dump into a CouchDB database
Summary
- Dependency
- Installation
-
How to
- Download dump
- Extract subset
-
Import
- Specify start and end line numbers:
- Behavior on conflict
- See also
- License
Dependency
- NodeJS >= v6. If your distribution doesn't provide an recent version of NodeJS, you might want to uninstall NodeJS and reinstall it using NVM
Installation
git clone https://github.com/maxlath/import-wikidata-dump-to-couchdb
cd import-wikidata-dump-to-couchdb
npm install
Now you can customize ./config/default.js
to your needs.
How to
Download dump
Download Wikidata latest dump
Extract subset
Extract the subset of the dump fitting your needs, as you might not want to throw ~40Go at your database's face.
For instance, for the needs of the authors-birthday bot, I wanted to keep only Wikidata entities of writers:
As each line of the dump is an entity, you could do something like this with grep
cat dump.json | grep '36180\,' > isWriter.json
Here the trick is that every entity with occupation-> writer (P106->Q36180) will have 36180 somewhere in the line (as a claim numeric-id
). And tadaa, you went from a 39Go dump to a way nicer 384Mo subset.
But now, we can do something cleaner using wikidata-filter:
cat dump.json | wikidata-filter --claim P106:Q36180 > isWriter.json
Import
This new file isnt valid json (it's line-delimited JSON), but every new line is, once you remove the coma at the end of the line, so here is the plan: take every line, remove the coma, PUT it in your database:
./import.js ./isWriter.json
Specify start and end line numbers:
startline=5
# the line 10 will be included
endline=10
./import.js ./isWriter.json $startline $endline
Behavior on conflict
In the config file (./config/default.js
), you can set the behavior on conflict, that is, when the importers tries to add an entity that was already previously added to CouchDB:
-
update
(default): update document if there is a change, otherwise pass. -
pass
: always pass -
exit
: exit process at first conflict
See also
- wikidata-filter: a command-line tool to filter a Wikidata dump by claim
- wikidata-subset-search-engine: tools to setup an ElasticSearch instance fed with subsets of Wikidata
- wikidata-sdk: a javascript tool-suite to query Wikidata and simplify its results
- wikidata-cli: read and edit Wikidata from the command line
License
MIT