Performance issue when upserting 10k+ edges using MERGE
Hi, I am trying to insert/update around 70K edges into the graph using age. I created the following function:
CREATE OR REPLACE FUNCTION create_update_edge(edge_list TEXT)
RETURNS void AS $$
DECLARE
query TEXT;
BEGIN
query := format($q$
SELECT * FROM ag_catalog.cypher(
'test_attribute_instance_graph',
$cypher$
UNWIND %s AS edge
MERGE (a:attribute_instance {name: edge.source_name, value: edge.source_value})-[r:cooccurs_with]-(b:attribute_instance {name: edge.target_name, value: edge.target_value})
SET r.count = edge.count
SET r.first_seen = edge.first_seen
SET r.last_seen = edge.last_seen
$cypher$
) AS (ignored agtype);
$q$, edge_list);
EXECUTE query;
RAISE NOTICE 'Edges merged and properties set.';
END;
$$ LANGUAGE plpgsql;
These are some of the points I had to take care of :
- Function signature of ag_catalog.cyphe is (name, cstring, agtype)) So, I was not directly able to pass JSONB format as a parameter, and had to change the list of edges(each data is mapped to a dict) to a Cypher-compatible map(keys are unquoted) before passing it to the above function.
- Trying to use MERGE..ON CREATE/ON MATCH SET gave syntax error, so I simply used MERGE SET to do the upsertion of edges.
- Trying to B-tree indexing was a failed attempt, as again I was coming across a syntax error using the standard CREATE INDEX for B-tree command, so I used GIN Index.
Under this setup and batch unwind (batch size 1000 edges): I got the following results
| Edges Inserted | Time (s) | Time (min) |
|---|---|---|
| 1,000 | 0.62 | 0.01 |
| 2,000 | 16.35 | 0.27 |
| 3,000 | 22.42 | 0.37 |
| 4,000 | 69.63 | 1.16 |
| 5,000 | 104.75 | 1.75 |
| 6,000 | 177.12 | 2.95 |
| 7,000 | 447.66 | 7.46 |
| 8,000 | 598.38 | 9.97 |
| 9,000 | 913.82 | 15.23 |
| 10,000 | 1320.22 | 22.00 |
Clearly, the performance is pretty bad, so I was wondering if this is a limitation of age or a wrong implementation on my end. If someone could give me insights on my method and any other ways I should try inserting the data/or optimise the current method, it'd be of great help! Thankyou!!
@Yashmitha1 One way you could improve this would be to use a directed edge (-[]-> or <-[]-). Currently, you are using a bi-directional edge (-[]-), which are notoriously slow. Maybe you could try 2 queries, one for each direction?
@Yashmitha1 Also could you share the execution plan?
Hi, doing something similar to this, we are using directed edges, and trying to create >8 million edges, and noticed this similar trend in performance, where after the approx 200k insertions, the performance starts to degrade very badly. We also tried to skip to that offset 200k and start our insertions from that offset, but the performance is still very poor. Is there any advice on this?
@ysvoon Could you provide your query and execution plan?
Hi @jrgemignani, I've created a separate issue with the sample query and what we've tried and the query plan here https://github.com/apache/age/issues/2198