sql2graph
sql2graph copied to clipboard
Adjust the batchimport to the new features
Hi there, I imported the musicbrainz database to Neo4j using the following approach, helped by @jexp:
Define 2 indexes (one mbid exact, for MBIDs and one mb fulltext, for everything else) in batch.properties:
batch_import.keep_db=false
batch_import.mapdb_cache.disable=true
batch_import.node_index.mb=fulltext
batch_import.node_index.mbid=exact
batch_import.csv.quotes=false
cache_type=none
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=300M
neostore.relationshipstore.db.mapped_memory=3G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=0M
neostore.propertystore.db.index.keys.mapped_memory=15M
neostore.propertystore.db.index.mapped_memory=15M
Then, create the indexing instructions directly in the node.csv and rels.csv files, so we don't need the ...index.csv files anymore, see https://github.com/jexp/batch-import -> automatic indexing
kind:string:mb comment status position name:string:mb area gender format barcode number ended length end_date_year begin_date_year mbid:string:mbid type:string:mb pk
artist Talkshow Boy f e8d94cf5-fafa-48fc-a6fa-aa50cf54d7f3 288762
artist Vibulator f 735bfaad-6eb1-4f9c-b21d-cbaef7c79a92 97944
artist Eat Me f c38a93e8-2ecf-4848-b1d2-364202d9dc0c Group 499198
artist Uffe Andersen f a7f3c871-3ba3-40b1-ba58-d08b40312789 Person 514886
artist Headust f eda60727-7036-437b-b53d-ae472818ee3a 212148
artist Sons Of The Subway f 232d5716-c2b2-47e1-aa0c-264ec69e6a18 100774
artist The Poe Boy Family f 672d599e-6a6c-456e-98ba-dac5a45e3ed8 43132
artist Ralph Gusovius Germany Male f 1950 6ecfcea1-677d-427b-a38b-9c76ce92e313 Person 295052
artist Elastik Band f 46e0639c-1ccf-45f5-b886-4cbf5549a2a1 61467
And then import the two files with something like
java -Xmx10G -server -Dfile.encoding=UTF-8 -jar ~/neo/batch-import/target/batch-import-jar-with-dependencies.jar ./graph.db nodes.csv rels.csv
WDYT? It would make the output a lot easier, and the import took about 10min on my machine, 160M Properties, 75M relatoinships ...
Thank you Peter,
it happens I already started a branch "multi_nodescsv" in this very direction https://github.com/redapple/sql2graph/tree/multi_nodescsv This branch also uses different CSV nodes files (another recent feature from batchimport), theses CSV files can be generated directly by the database engine (at least Postgresql in the case of MusicBrainz)
The branch is not very clean yet, but uses automatic indexing for MusicBrainz, but in a different (and more naïve way). Comparatively, your "mbid" exact index for all MBIDs is smart; in the branch I am using an index per entity (artists, labels...) and indexing "mbid" for each, which is definitely less elegant.
I should be connected back in the coming days so I can work on updating "multi_nodescsv" branch with your ideas, and merge them into master if we converge.
As for the processing times you have, I'm afraid I don't have that much RAM on my laptop or server :) (I'm running out of memory when importing too many entities). Great to hear you've been able to import all MusicBrainz!
Yes, let's connect, I am peter.neubauer on Skype!