graphipedia Performance notes are wrong

Hello,

if it took 30 minutes to process 9.1GB file, it means that the throughput was 5,06 MB/s. (9.1G = 1024 * 9.1 MB = 9100 MB, 9100 / (30 * 60s) = 5,055555556 MB/s 5400 disks have 40 MB/s read / write throughput, so they are not the bottleneck. To speed things up you can use lbzip2 which is multi-threaded (it helped me a lot).

Best regards

Jun 19 '15 14:06 slonka

Well, when creating a db neo4j is not simply writing data sequentially to the disk so I wouldn't expect it to reach the max throughput. In my tests the disk made a huge difference so I called it the "critical factor" (not "bottleneck"). But thanks for suggesting lbzip2, will add it to the README.

Jun 19 '15 22:06 mirkonasato

I only described what I thought was wrong with the description of first step of the importing process, which is read -> regexp -> write (creating intermediate XML file). The second part is still running (7 hours, and it only imported 70M links). I have no idea how you managed to do it in only 10 minutes.

I've run jvisualvm, iotop, htop and discovered that at the beginning the process is mostly running read / write operations (org.neo4j.io.fs.StoreFIleChannel.write / read). It creates 50K links per 3 seconds and at that pace the whole thing would take 1 hour and 40 minutes. After a while it starts to run more MuninnPageCache operations (flushAtIORatio, parkUntilEvictionRequired) and slows down significantly.

In the first part of the operation the CPU usage was maxed out (95-100% on 4 cores) and the read/write throughput was 10 MB/s and 5 MB/s respectively. Now in the second part the CPU usage is really low (around 10%) write throughput is around 10 MB/s.

iostat shows that cpu is mostly waiting on IO or idle

avg-cpu: %user %nice %system %iowait %steal %idle 5,10 0,00 1,58 43,36 0,00 49,96

I think there is something wrong with caching mechanism in neo4j. What do you think?

Jun 20 '15 08:06 slonka

Fair point, I didn't realise you were talking about the first step only.

Yes, the second part i.e. creating the graph db is where having an SSD really helps. I haven't really investigated much, but I guess it must be doing a lot of random access operations.

Jun 21 '15 22:06 mirkonasato

graphipedia graphipedia copied to clipboard

Performance notes are wrong

graphipedia
graphipedia copied to clipboard