sql2graph Adjust the batchimport to the new features

Hi there, I imported the musicbrainz database to Neo4j using the following approach, helped by @jexp:

Define 2 indexes (one mbid exact, for MBIDs and one mb fulltext, for everything else) in batch.properties:

batch_import.keep_db=false
batch_import.mapdb_cache.disable=true
batch_import.node_index.mb=fulltext
batch_import.node_index.mbid=exact
batch_import.csv.quotes=false
cache_type=none
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=300M
neostore.relationshipstore.db.mapped_memory=3G
neostore.propertystore.db.mapped_memory=500M
neostore.propertystore.db.strings.mapped_memory=500M
neostore.propertystore.db.arrays.mapped_memory=0M
neostore.propertystore.db.index.keys.mapped_memory=15M
neostore.propertystore.db.index.mapped_memory=15M

Then, create the indexing instructions directly in the node.csv and rels.csv files, so we don't need the ...index.csv files anymore, see https://github.com/jexp/batch-import -> automatic indexing

kind:string:mb  comment status  position    name:string:mb  area    gender  format  barcode number  ended   length  end_date_year   begin_date_year mbid:string:mbid    type:string:mb  pk
artist              Talkshow Boy                        f               e8d94cf5-fafa-48fc-a6fa-aa50cf54d7f3        288762
artist              Vibulator                       f               735bfaad-6eb1-4f9c-b21d-cbaef7c79a92        97944
artist              Eat Me                      f               c38a93e8-2ecf-4848-b1d2-364202d9dc0c    Group   499198
artist              Uffe Andersen                       f               a7f3c871-3ba3-40b1-ba58-d08b40312789    Person  514886
artist              Headust                     f               eda60727-7036-437b-b53d-ae472818ee3a        212148
artist              Sons Of The Subway                      f               232d5716-c2b2-47e1-aa0c-264ec69e6a18        100774
artist              The Poe Boy Family                      f               672d599e-6a6c-456e-98ba-dac5a45e3ed8        43132
artist              Ralph Gusovius  Germany Male                f           1950    6ecfcea1-677d-427b-a38b-9c76ce92e313    Person  295052
artist              Elastik Band                        f               46e0639c-1ccf-45f5-b886-4cbf5549a2a1        61467

And then import the two files with something like

java -Xmx10G -server -Dfile.encoding=UTF-8 -jar ~/neo/batch-import/target/batch-import-jar-with-dependencies.jar ./graph.db nodes.csv rels.csv

WDYT? It would make the output a lot easier, and the import took about 10min on my machine, 160M Properties, 75M relatoinships ...

Aug 17 '13 18:08 peterneubauer

Thank you Peter,

it happens I already started a branch "multi_nodescsv" in this very direction https://github.com/redapple/sql2graph/tree/multi_nodescsv This branch also uses different CSV nodes files (another recent feature from batchimport), theses CSV files can be generated directly by the database engine (at least Postgresql in the case of MusicBrainz)

The branch is not very clean yet, but uses automatic indexing for MusicBrainz, but in a different (and more naïve way). Comparatively, your "mbid" exact index for all MBIDs is smart; in the branch I am using an index per entity (artists, labels...) and indexing "mbid" for each, which is definitely less elegant.

I should be connected back in the coming days so I can work on updating "multi_nodescsv" branch with your ideas, and merge them into master if we converge.

As for the processing times you have, I'm afraid I don't have that much RAM on my laptop or server :) (I'm running out of memory when importing too many entities). Great to hear you've been able to import all MusicBrainz!

Aug 17 '13 23:08 redapple

Yes, let's connect, I am peter.neubauer on Skype!

Aug 19 '13 09:08 peterneubauer

sql2graph sql2graph copied to clipboard

Adjust the batchimport to the new features

sql2graph
sql2graph copied to clipboard