apoc icon indicating copy to clipboard operation
apoc copied to clipboard

File Size Limit for .graphml import?

Open MikeB2019x opened this issue 2 years ago • 8 comments

Expected Behavior

I have been using the following command to import .graphml files in to neo4j

CALL apoc.import.graphml("xxx.graphml", {readLabels: true, storeNodeIds:true})

This has worked in the past with .graphml files up to 1 GB in size.

Actual Behavior

I've recently had to work with larger .graphml files. An import of a file that was 3GB in size proceeded without error except that while all the nodes were imported only half the edges were. No error or warning was thrown.

Note that the .graphml is an xml document that begins with meta info, followed by node info, followed by edge info. Since the import stops midway through the edge info I'm wondering if there is a setting/limit on number of lines or size of the .graphml file?

How to Reproduce the Problem

  1. Create a large graph in networkx. (4.6M nodes w/20 attributes (float) each, 4.8M edges)
  2. Export as a .graphml file
  3. Import .graphml file in to Neo4j using command above.

Versions

  • OS: Mac Pro M1 w/ Ventrua 13.3.1 (a)
  • Neo4j: 4.4.o (community)
  • Neo4j-Apoc: 4.4.0.1

MikeB2019x avatar Jun 28 '23 13:06 MikeB2019x

Hi! I tried this out and did indeed run into issues with a larger file, although I could see an OOM in my logs (have you checked the logs? perhaps you also have this?), the fix for me was to adjust the batchSize in the config.

CALL apoc.import.graphml("xxx.graphml", {readLabels: true, storeNodeIds:true, batchSize: 100})

I am unsure on the optimum number here, but the default is 20,000 so I imagine 100 was a bit extreme in lowness 😅

I'll also ticket this to see if we can make either performance improvements or at least throw an exception instead of crashing the query!

Let me know if this helps :)

gem-neo4j avatar Jun 29 '23 07:06 gem-neo4j

Thank you for the 'batch size' tip, it will be useful b/c the next batch of files will be larger.

Note, my situation is slightly different as there is no error, it's just that half the edges are ignored/not imported. For example, the graphml contains 4M nodes, 4M edges but after an error free upload the neo4j db shows 4M nodes and 2M edges. I looked to see if the graphml had duplicates of edges but that is not the case.

MikeB2019x avatar Jun 29 '23 12:06 MikeB2019x

The logs are in debug.log :)

The file is imported in batches of transactions, so if all the edges are last in the file, then it potentially crashes before it hits them, but the transaction has already committed the nodes, which might explain the discrepancy.

gem-neo4j avatar Jun 29 '23 14:06 gem-neo4j

Yeah found them =D This is the log from executing the import, with no batch, into an empty db. Nothing indicates an error to me or what am I missing?

2023-06-30 03:16:54.710+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=1241, gcTime=1272, gcCount=2}
2023-06-30 03:17:01.769+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.228] version=227, last transaction in previous log=25471, rotation took 50 millis.
2023-06-30 03:17:12.914+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/bcb61400] Discarded stale query from the query cache after 861 seconds. Reason: NodesAllCardinality changed from 10.0 to 599999.0, which is a divergence of 0.9999833333055556 which is greater than threshold 0.614853273406578. Query id: 83
2023-06-30 03:17:18.541+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.229] version=228, last transaction in previous log=25474, rotation took 67 millis, started after 16705 millis.
2023-06-30 03:17:23.979+0000 WARN  [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=323, gcTime=404, gcCount=1}
2023-06-30 03:17:36.260+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.230] version=229, last transaction in previous log=25477, rotation took 67 millis, started after 17651 millis.
2023-06-30 03:17:53.573+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.231] version=230, last transaction in previous log=25480, rotation took 68 millis, started after 17246 millis.
2023-06-30 03:18:11.850+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.232] version=231, last transaction in previous log=25483, rotation took 85 millis, started after 18192 millis.
2023-06-30 03:18:12.934+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/bcb61400] Discarded stale query from the query cache after 59 seconds. Reason: NodesAllCardinality changed from 599999.0 to 2599999.0, which is a divergence of 0.7692310650888712 which is greater than threshold 0.7404586799070909. Query id: 92
2023-06-30 03:18:33.509+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.233] version=232, last transaction in previous log=25486, rotation took 118 millis, started after 21541 millis.
2023-06-30 03:18:56.764+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.234] version=233, last transaction in previous log=25489, rotation took 128 millis, started after 23127 millis.
2023-06-30 03:19:21.378+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.235] version=234, last transaction in previous log=25492, rotation took 104 millis, started after 24509 millis.
2023-06-30 03:19:51.403+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.236] version=235, last transaction in previous log=25497, rotation took 179 millis, started after 29846 millis.
2023-06-30 03:19:53.004+0000 INFO  [o.n.c.i.ExecutionEngine] [neo4j/bcb61400] Discarded stale query from the query cache after 99 seconds. Reason: CardinalityByLabelsAndRelationshipType(None,None,None) changed from 1.0 to 722148.0, which is a divergence of 0.9999986152423049 which is greater than threshold 0.7329806490780797. Query id: 107
2023-06-30 03:20:16.987+0000 INFO  [o.n.k.d.Database] [neo4j/bcb61400] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.237] version=236, last transaction in previous log=25503, rotation took 270 millis, started after 25315 millis.

MikeB2019x avatar Jun 30 '23 03:06 MikeB2019x

Hmm okay, how does the query log look for it? Also did it work with trying the batchSize? I can't reproduce a case where it just misses the relationships 🙈

gem-neo4j avatar Jun 30 '23 11:06 gem-neo4j

Thank you for the replies! Yes, the process works when using batchSize but I get the same result i.e. half the edges but no error. Note that I have confirmed the .graphml file is correct. If I open it in networkx all nodes and edges are present.

MikeB2019x avatar Jun 30 '23 12:06 MikeB2019x

Okay, here's what happened. This is an example of the edges as represented in the .graphml:

    <edge source="node_invoice__204779" target="node_payments__200180" />

Notice there is no label. The edge list that was used does not have any relationship name it just specifies the two nodes. After import, this is what was appearing in the browser: image The number is exactly half of the number of edges. What looks to have happened is that a label has been added during or after import but to a subset of the nodes. If I click on one of the edges I get: image I had noticed the 'related' tag but didn't think it through as I assumed it had been applied to all the edges. So:

MATCH()-[e]->() RETURN count(e)

Returns 2323193. Which matched what I was seeing. But when I used:

MATCH ()-[e]-() return count(e)
MATCH ()-[e:RELATED]-() return count(e)

both return 4646386. Why would there be a difference, with these queries? I expected them all to return the same value. And even if there was a difference I would have expected the UI to be showing the result of the last two queries. Thoughts?

MikeB2019x avatar Jun 30 '23 16:06 MikeB2019x

The "RELATED" type is added as every relationship must have one type, and if none is specified APOC adds that generic one.

The reason why those 2 queries return double the amount is because they are returning 2 of every relationship. Matching on a path with no direction will return (a)-->(b) as well as (b)<--(a). If you only want one of each you need to add a direction :)

gem-neo4j avatar Jul 03 '23 12:07 gem-neo4j

Closing as there hasn't been a reply in a while and re-reading the exchange, it may be a misunderstanding in how to count relationships.

gem-neo4j avatar Mar 25 '25 08:03 gem-neo4j