neo4j-stackoverflow-import
neo4j-stackoverflow-import copied to clipboard
Utilities for Importing Stack-Exchange Dumps into Neo4j
== Stackoverflow dump to CSV converter
See this blog post for details: http://neo4j.com/blog/import-10m-stack-overflow-questions/
Download the Stack-Exchange dump you're interested in from the https://archive.org/details/stackexchange[internet archive].
Unzip the files using 7zip on your system (might need to install pk7zip, e.g. with apt-get/homebrew). You can optionally zip the individual xml files with gzip to save some space.
=== Convert a Stackoverflow Dump
These files and fields are currently handled out of the box
-
Posts("Id#id:ID(Post)", "Title#title", "PostTypeId#postType:INT", "CreationDate#createdAt:datetime", "Score#score:INT", "ViewCount#views:INT", "AnswerCount#answers:INT", "CommentCount#comments:INT", "FavoriteCount#favorites:INT", "LastEditDate#updatedAt:datetime")
-
Posts also creates
** PostsRels(":END_ID(Post)", ":START_ID(Post)"); - all answers of a question ** PostsAnswers(":START_ID(Post)", ":END_ID(Post)") - the accepted answer ** UsersPosts(":START_ID(User)", ":END_ID(Post)"); - the user who created this post ** TagsPosts(":START_ID(Post)", ":END_ID(Tag)"); - one entry per tag
- Users("Id#id:ID(User)", "DisplayName#name", "Reputation#reputation:INT", "CreationDate#createdAt:datetime", "LastAccessDate#accessedAt:datetime", "WebsiteUrl#url", "Location#location", "Views#views:INT", "UpVotes#upvotes:INT", "DownVotes#downvotes:INT", "Age#age:INT", "AccountId#accountId:INT");
- Tags("TagName#name:ID(Tag)", "Count#count:INT", "WikiPostId#wikiPostId:INT"),
Need to extend limits for XML processing as the stackoverflow dumps are too large.
mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.StackOverflowConverter -Dexec.args="/path/to/data" -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0
Will create in the data directory these files
Posts.csv.gz PostsAnswers.csv.gz PostsAnswers_header.csv PostsRels.csv.gz PostsRels_header.csv Posts_header.csv Tags.csv.gz TagsPosts.csv.gz TagsPosts_header.csv Tags_header.csv Users.csv.gz UsersPosts.csv.gz UsersPosts_header.csv Users_header.csv
=== Import
Import with:
export NEO=/path/to/neo
$NEO/bin/neo4j-import
--into stackoverflow.db
--id-type string
--bad-tolerance true --skip-bad-entries-logging true --skip-bad-relationships true --skip-duplicate-nodes true
--nodes:Post Posts_header.csv,Posts.csv.gz
--nodes:User Users_header.csv,Users.csv.gz
--nodes:Tag Tags_header.csv,Tags.csv.gz
--relationships:ANSWERED PostsRels_header.csv,PostsRels.csv.gz
--relationships:ACCEPTED PostsAnswers_header.csv,PostsAnswers.csv.gz
--relationships:TAGGED TagsPosts_header.csv,TagsPosts.csv.gz
--relationships:POSTED UsersPosts_header.csv,UsersPosts.csv.gz
Post-processing, e.g. with neo4j-shell or cypher-shell
- set
:Questionlabel on postType:1 - set
:Answerlabel on accepted answer
call apoc.periodic.iterate("MATCH (p:Post) where p.postType = 1 return p", "SET p:Question",{batchSize:10000,iterateList:true,parallel:true});
call apoc.periodic.iterate("MATCH (q:Question)-[:ACCEPTED]->(a) RETURN a", "SET a:Answer",{batchSize:10000,iterateList:true,parallel:true});
create index on :Question(title); create index on :Question(createdAt); create index on :Question(score); create index on :Question(views); create index on :Question(answers); create index on :Question(updatedAt); create index on :Question(favorites); create index on :Question(comments); create index on :Tag(count); create index on :User(name); create index on :User(reputation); create index on :User(createdAt); create index on :User(views); create index on :User(upvotes); create index on :User(age); create index on :User(accountId); create index on :User(location); create constraint on (t:Tag) assert t.name is unique; create constraint on (u:User) assert u.id is unique; create constraint on (p:Post) assert p.id is unique;
=== Internals
This script is run with individual files (e.g. Users.xml.gz) as parameters.
You can optionall provide the comma separated keys to extract after a colon:
mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.XmlToCsvConverter \
-Dexec.args="Users.xml.gz:Id,Reputation,CreationDate,DisplayName,LastAccessDate
Posts.xml.gz:Id,CreationDate,Score,OwnerUserId,Title,Tags,AnswerCount"
...
A quick sed script for extracting those keys from the first row is:
gzip -dc Users.xml.gz | head -3 | tail -1 | sed -e 's/="[^"]*" /,/g'
If you don't provide the keys, it samples the first 100 rows to learn about the keys present and uses them.
The first few rows might have fewer columns than latter ones if it learns about new key-columns later on. Unknown columns will be empty.
It will output a compressed csv file with the data, e.g Users.csv.gz and a header file Users_header.csv per input file.
mvn compile exec:java -Dexec.mainClass=org.neo4j.example.so.XmlToCsvConverter -Dexec.args="Users.xml.gz:Id,Reputation,CreationDate,DisplayName,LastAccessDate Posts.xml.gz:Id,CreationDate,Score,OwnerUserId,Title,Tags,AnswerCount"
[INFO] ------------------------------------------------------------------------ [INFO] Building stackoverflow-import 0.1-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO]
Processing Users.xml.gz Processing Posts.xml.gz Done processing Users.xml.gz with 99990 rows in 2 seconds. Done processing Posts.xml.gz with 152803 rows in 4 seconds.