neo4j-examples/neo4j-stackoverflow-import: Utilities for Importing Stack-Exchange Dumps into...

== Stackoverflow dump to CSV converter

See this blog post for details: http://neo4j.com/blog/import-10m-stack-overflow-questions/

Download the Stack-Exchange dump you're interested in from the https://archive.org/details/stackexchange[internet archive].

Unzip the files using 7zip on your system (might need to install pk7zip, e.g. with apt-get/homebrew). You can optionally zip the individual xml files with gzip to save some space.

=== Convert a Stackoverflow Dump

These files and fields are currently handled out of the box

Posts("Id#id:ID(Post)", "Title#title", "PostTypeId#postType:INT", "CreationDate#createdAt:datetime", "Score#score:INT", "ViewCount#views:INT", "AnswerCount#answers:INT", "CommentCount#comments:INT", "FavoriteCount#favorites:INT", "LastEditDate#updatedAt:datetime")
Posts also creates

** PostsRels(":END_ID(Post)", ":START_ID(Post)"); - all answers of a question ** PostsAnswers(":START_ID(Post)", ":END_ID(Post)") - the accepted answer ** UsersPosts(":START_ID(User)", ":END_ID(Post)"); - the user who created this post ** TagsPosts(":START_ID(Post)", ":END_ID(Tag)"); - one entry per tag

Users("Id#id:ID(User)", "DisplayName#name", "Reputation#reputation:INT", "CreationDate#createdAt:datetime", "LastAccessDate#accessedAt:datetime", "WebsiteUrl#url", "Location#location", "Views#views:INT", "UpVotes#upvotes:INT", "DownVotes#downvotes:INT", "Age#age:INT", "AccountId#accountId:INT");
Tags("TagName#name:ID(Tag)", "Count#count:INT", "WikiPostId#wikiPostId:INT"),

Need to extend limits for XML processing as the stackoverflow dumps are too large.

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.StackOverflowConverter -Dexec.args="/path/to/data" -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0

Will create in the data directory these files

Posts.csv.gz PostsAnswers.csv.gz PostsAnswers_header.csv PostsRels.csv.gz PostsRels_header.csv Posts_header.csv Tags.csv.gz TagsPosts.csv.gz TagsPosts_header.csv Tags_header.csv Users.csv.gz UsersPosts.csv.gz UsersPosts_header.csv Users_header.csv

=== Import

Import with:

export NEO=/path/to/neo $NEO/bin/neo4j-import
--into stackoverflow.db
--id-type string
--bad-tolerance true --skip-bad-entries-logging true --skip-bad-relationships true --skip-duplicate-nodes true
--nodes:Post Posts_header.csv,Posts.csv.gz
--nodes:User Users_header.csv,Users.csv.gz
--nodes:Tag Tags_header.csv,Tags.csv.gz
--relationships:ANSWERED PostsRels_header.csv,PostsRels.csv.gz
--relationships:ACCEPTED PostsAnswers_header.csv,PostsAnswers.csv.gz
--relationships:TAGGED TagsPosts_header.csv,TagsPosts.csv.gz
--relationships:POSTED UsersPosts_header.csv,UsersPosts.csv.gz

Post-processing, e.g. with neo4j-shell or cypher-shell

set :Question label on postType:1
set :Answer label on accepted answer

call apoc.periodic.iterate("MATCH (p:Post) where p.postType = 1 return p", "SET p:Question",{batchSize:10000,iterateList:true,parallel:true});

call apoc.periodic.iterate("MATCH (q:Question)-[:ACCEPTED]->(a) RETURN a", "SET a:Answer",{batchSize:10000,iterateList:true,parallel:true});

create index on :Question(title); create index on :Question(createdAt); create index on :Question(score); create index on :Question(views); create index on :Question(answers); create index on :Question(updatedAt); create index on :Question(favorites); create index on :Question(comments); create index on :Tag(count); create index on :User(name); create index on :User(reputation); create index on :User(createdAt); create index on :User(views); create index on :User(upvotes); create index on :User(age); create index on :User(accountId); create index on :User(location); create constraint on (t:Tag) assert t.name is unique; create constraint on (u:User) assert u.id is unique; create constraint on (p:Post) assert p.id is unique;

=== Internals

This script is run with individual files (e.g. Users.xml.gz) as parameters.

You can optionall provide the comma separated keys to extract after a colon:

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.XmlToCsvConverter \ -Dexec.args="Users.xml.gz:Id,Reputation,CreationDate,DisplayName,LastAccessDate
Posts.xml.gz:Id,CreationDate,Score,OwnerUserId,Title,Tags,AnswerCount" ...

A quick sed script for extracting those keys from the first row is:

gzip -dc Users.xml.gz | head -3 | tail -1 | sed -e 's/="[^"]*" /,/g'

If you don't provide the keys, it samples the first 100 rows to learn about the keys present and uses them.

The first few rows might have fewer columns than latter ones if it learns about new key-columns later on. Unknown columns will be empty.

It will output a compressed csv file with the data, e.g Users.csv.gz and a header file Users_header.csv per input file.

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.XmlToCsvConverter -Dexec.args="Users.xml.gz:Id,Reputation,CreationDate,DisplayName,LastAccessDate Posts.xml.gz:Id,CreationDate,Score,OwnerUserId,Title,Tags,AnswerCount"

[INFO] ------------------------------------------------------------------------ [INFO] Building stackoverflow-import 0.1-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO]

Processing Users.xml.gz Processing Posts.xml.gz Done processing Users.xml.gz with 99990 rows in 2 seconds. Done processing Posts.xml.gz with 152803 rows in 4 seconds.

[INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 9.265 s [INFO] Finished at: 2015-10-11T23:01:57+02:00 [INFO] Final Memory: 24M/330M [INFO] ------------------------------------------------------------------------

neo4j-stackoverflow-import
neo4j-stackoverflow-import copied to clipboard

Metadata

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.StackOverflowConverter -Dexec.args="/path/to/data" -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0

Posts.csv.gz PostsAnswers.csv.gz PostsAnswers_header.csv PostsRels.csv.gz PostsRels_header.csv Posts_header.csv Tags.csv.gz TagsPosts.csv.gz TagsPosts_header.csv Tags_header.csv Users.csv.gz UsersPosts.csv.gz UsersPosts_header.csv Users_header.csv

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.XmlToCsvConverter \ -Dexec.args="Users.xml.gz:Id,Reputation,CreationDate,DisplayName,LastAccessDate
Posts.xml.gz:Id,CreationDate,Score,OwnerUserId,Title,Tags,AnswerCount" ...

← Metadata

Owner

Metadata

neo4j-stackoverflow-import neo4j-stackoverflow-import copied to clipboard

Metadata

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.StackOverflowConverter -Dexec.args="/path/to/data" -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0

Posts.csv.gz PostsAnswers.csv.gz PostsAnswers_header.csv PostsRels.csv.gz PostsRels_header.csv Posts_header.csv Tags.csv.gz TagsPosts.csv.gz TagsPosts_header.csv Tags_header.csv Users.csv.gz UsersPosts.csv.gz UsersPosts_header.csv Users_header.csv

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.XmlToCsvConverter \ -Dexec.args="Users.xml.gz:Id,Reputation,CreationDate,DisplayName,LastAccessDate Posts.xml.gz:Id,CreationDate,Score,OwnerUserId,Title,Tags,AnswerCount" ...

← Metadata

Owner

Metadata

neo4j-stackoverflow-import
neo4j-stackoverflow-import copied to clipboard

mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.XmlToCsvConverter \ -Dexec.args="Users.xml.gz:Id,Reputation,CreationDate,DisplayName,LastAccessDate
Posts.xml.gz:Id,CreationDate,Score,OwnerUserId,Title,Tags,AnswerCount" ...