age icon indicating copy to clipboard operation
age copied to clipboard

Performance Issue in Edge Creation for Large Tables and Index Creation

Open ysvoon opened this issue 6 months ago • 19 comments

Questions:

  1. How can I create index for graph tables?
  2. How effective are they? I am using a gin index right now on properties, and I can see that the performance does improve for the initial 200k edge creations, but the performance starts to degrade severely after that, does creating indexes on specifically the key that I am using in the MATCH clause of my edge creation query help?
  3. I have also tried to skip to the 200k offset and start my edge creations from this point, but the performance is still poor for the rows after the 200k mark onwards, does this mean there is a limitation to indexes? How can I insert 8 million edges to my table within a reasonable time period?

What we have tried:

Question 1

I have referred to Issue https://github.com/apache/age/issues/2137 to create btree and gin indexes, so far only this works for me:

# worked and query plan shows that we are using the index, but is slow after >200k edge creations
CREATE INDEX index_name ON schema_name."table_name" USING gin (properties);

# worked but query plan shows that we are using Seq Scan
CREATE INDEX IF NOT EXISTS index_name ON schema_name."table_name" USING gin ((properties -> '"tableId"'));

# worked but query plan shows us we are using Seq Scan
CREATE INDEX index_name ON schema_name."table_name" USING btree ((properties->>'"tableId"'));

and the following returns syntax errors:

# trying to create gin index on more specific key within properties that I am using in the MATCH clause for edge creation query
# got error -> ERROR:  operator does not exist: ag_catalog.agtype -> unknown

CREATE INDEX IF NOT EXISTS index_name
ON schema_name."table_name" USING gin ((properties -> 'some_id'));


# trying to create btree index in general
# got error -> ERROR:  type "agtype" does not exist

CREATE INDEX idx_btree_name
ON test."NodeType1"
USING btree (agtype_access_operator(VARIADIC ARRAY[properties, '"name"'::agtype]));


# got server crash/disconnections in pgadmin

CREATE INDEX IF NOT EXISTS idx_user_id ON goodreads_graph."User" USING BTREE (id);


Question 2 & 3

Our query looks something like this:

SELECT * FROM cypher('schema_name', $$
                    WITH {cypher_array} AS rows
                    UNWIND rows AS row
                    MATCH (a:Person {PersonId: row.StartId}),
                          (b:Person {PersonId: row.EndId})
                    CREATE (a)-[e:{edge_name} {{
                        StartId: row.StartId,
                        EndId: row.EndId,
                        otherprops: row.other_props
                    }}]->(b)
                $$) AS (e agtype);

and we are doing one insertion for every 100 rows, for an offset of 100k rows from the original graph table that has 8 million rows. Initially it would take under 2-3s for every 100 rows, but starting from the 200k offset, the time would increase up to 100 - 1000s for every 100 rows.

The query plan tells me it seems to be using the index I've created Bitmap Index Scan on idx_something..., and I am at a loss as to what other things I can try to improve the performance of the edge creation queries.

If you could help us understand this, it would be a great help, thank you!

ysvoon avatar Aug 04 '25 09:08 ysvoon

@ysvoon Could you provide the query plan? Also, are you increasing the size of Person? Meaning, adding in more people?

For the query plan, use explain analyze -

SELECT * FROM cypher('schema_name', $$ EXPLAIN ANALYZE 
                    WITH {cypher_array} AS rows
                    UNWIND rows AS row
                    MATCH (a:Person {PersonId: row.StartId}),
                          (b:Person {PersonId: row.EndId})
                    CREATE (a)-[e:{edge_name} {{
                        StartId: row.StartId,
                        EndId: row.EndId,
                        otherprops: row.other_props
                    }}]->(b)
                $$) AS (e agtype);

jrgemignani avatar Aug 04 '25 16:08 jrgemignani

@ysvoon What I can say about your query above is that it is not scalable for larger datasets; let me see if I can explain why,...

Basically, what you are asking, in pseudo code, is the following -

for row in cypher_array
{
    for a in Person
    {
        for b in Person
        {
             if (a.StartId == row.StartId && b.EndId == row.EndId)
             {
                 create ...
             }
         }
    }
}

The problem with this query is that it amounts to a function with a runtime of O(m*n^2). As you can see, as n (the size of Persons gets larger the amount of processing is squared and then multiplied by m, the number of rows in cypher_array. While this is fine for smaller datasets, it isn't for larger ones.

Edit: I'm not sure if indexes can help much here, but I'm not an expert.

jrgemignani avatar Aug 04 '25 16:08 jrgemignani

@ysvoon Could you provide the query plan? Also, are you increasing the size of Person? Meaning, adding in more people?

For the query plan, use explain analyze -

SELECT * FROM cypher('schema_name', $$ EXPLAIN ANALYZE 
                    WITH {cypher_array} AS rows
                    UNWIND rows AS row
                    MATCH (a:Person {PersonId: row.StartId}),
                          (b:Person {PersonId: row.EndId})
                    CREATE (a)-[e:{edge_name} {{
                        StartId: row.StartId,
                        EndId: row.EndId,
                        otherprops: row.other_props
                    }}]->(b)
                $$) AS (e agtype);

The size of Person table will likely stay within the range of millions (around 8M nodes) and will gradually increase with time. For more context, it is usually for the first time full load that we need to create these many edges from the Person table, which will create around 8M edges, and we have two other edge tables of such sizes to be created. Ideally, we would want to ensure that all the edges can be created within a maximum of 2 days, but right now it is looking to be in the unit of months.

For such a query,

EXPLAIN ANALYZE SELECT * FROM cypher('schema_name', $$
	MATCH (a: Person {PersonId: '<redacted>'}), (b: Person {PersonId: '<redacted>'})
	CREATE (a)-[e:edge_name{StartId: '<redacted>', EndId: '<redacted>'}]->(b)
	RETURN e $$)
	as (e agtype);

The query plan is as follows:

"Custom Scan (Cypher Create)  (cost=0.00..0.00 rows=0 width=32) (actual time=1161.431..3478.889 rows=2 loops=1)"
"  ->  Subquery Scan on _age_default_alias_previous_cypher_clause  (cost=261.17..3383466.14 rows=73822464 width=32) (actual time=0.243..0.270 rows=2 loops=1)"
"        ->  Nested Loop  (cost=261.17..2645241.50 rows=73822464 width=192) (actual time=0.243..0.267 rows=2 loops=1)"
"              ->  Bitmap Heap Scan on ""Person"" a  (cost=130.58..30716.89 rows=8592 width=376) (actual time=0.139..0.140 rows=1 loops=1)"
"                    Recheck Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"                    Heap Blocks: exact=1"
"                    ->  Bitmap Index Scan on gin_idx_person_id  (cost=0.00..128.44 rows=8592 width=0) (actual time=0.129..0.129 rows=1 loops=1)"
"                          Index Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"              ->  Materialize  (cost=130.58..30759.85 rows=8592 width=376) (actual time=0.079..0.092 rows=2 loops=1)"
"                    ->  Bitmap Heap Scan on ""Person"" b  (cost=130.58..30716.89 rows=8592 width=376) (actual time=0.076..0.086 rows=2 loops=1)"
"                          Recheck Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"                          Heap Blocks: exact=2"
"                          ->  Bitmap Index Scan on gin_idx_person_id  (cost=0.00..128.44 rows=8592 width=0) (actual time=0.073..0.073 rows=2 loops=1)"
"                                Index Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"Planning Time: 0.373 ms"
"Execution Time: 3478.965 ms"

I understood we have two bitmap heap scans on the Person table because we are doing 2 MATCH clause to match the start_id and end_id for the edge creation, and it seems like when I'm using GIN index on just properties, the query plan shows that it is using the index, but how efficiently it is used I am not sure.

Edit: when we were using the GIN index on properties ('"PersonId"') or btree index on properties ('"PersonId"'), it does Seq Scan instead of Bitmap Index Scan, could it be that the index was created on "PersonId" and not PersonId? However, I could not get the index to be created if I change the quotes from what I have provided above (will get syntax or column not found errors).

ysvoon avatar Aug 05 '25 02:08 ysvoon

@ysvoon What I can say about your query above is that it is not scalable for larger datasets; let me see if I can explain why,...

Basically, what you are asking, in pseudo code, is the following -

for row in cypher_array
{
    for a in Person
    {
        for b in Person
        {
             if (a.StartId == row.StartId && b.EndId == row.EndId)
             {
                 create ...
             }
         }
    }
}

The problem with this query is that it amounts to a function with a runtime of O(m*n^2). As you can see, as n (the size of Persons gets larger the amount of processing is squared and then multiplied by m, the number of rows in cypher_array. While this is fine for smaller datasets, it isn't for larger ones.

Edit: I'm not sure if indexes can help much here, but I'm not an expert.

Yes, I do understand that this is most likely the case, however when we were trying the query without cypher_array and/or UNWIND, we were executing either single CREATE or batched CREATE (of size 100) queries in one go, and we observed that the performance was definitely slower, even with indexing.

For vertex graph creation, this process was fairly fast, we only observe this performance issue with the edge creation process.

How would you recommend us to modify our query to improve the execution time for such large datasets, specifically for edge creations? (Note: we cannot use the csv file load method)

Are there any functions or best practices that others have used to insert such large datasets for edges; or if Apache Age is even a suitable platform for our use case, please do let us know.

Thank you for any advice you have for us. :)

ysvoon avatar Aug 05 '25 02:08 ysvoon

For vertex graph creation, this process was fairly fast, we only observe this performance issue with the edge creation process.

This is due to the 2 MATCH commands used in creating the edges, as you are building your edges from the Person table; this is a O(n^2) operation.

You might want to try the MERGE command.

Without understanding your use case or dataset, it is hard to suggest alternatives.

jrgemignani avatar Aug 05 '25 17:08 jrgemignani

This is due to the 2 MATCH commands used in creating the edges, as you are building your edges from the Person table; this is a O(n^2) operation.

You might want to try the MERGE command.

I see, I've tried the MERGE command today and there are a few issues I have observed:

  1. The performance is alright, but again after >200k entries it starts to slow down. In the beginning it was taking just a few seconds to create 100 edges, but after the >200k mark, it takes >1 minute to create 100 edges. Was thinking could it be that after a certain point, indexing won't help and for MERGE, where it still does a Seq Scan on the edges, maybe this is why the performance worsens with time?

  2. If I use the MERGE command like this, it creates duplicated vertices, so I don't think this is the correct way.

    SELECT * FROM cypher('schema_name', $$
    MERGE (a: Person {PersonId: '<redacted>'})-[e:edge_name]->(b: Person {PersonId: '<redacted>'})
    SET e.StartId = '<redacted>', e.EndId = '<redacted>'
    RETURN e $$)
    as (e agtype);
    
  3. So I tried the MERGE command as recommended here https://github.com/apache/age/issues/1517, but it doesn't create the properties (returned empty), and I had to run this same query for the second time for the properties to show up, same as this issue here https://github.com/apache/age/issues/1907, although it says here that it has been resolved, I am seeing this issue still.

    SELECT * FROM cypher('schema_name', $$
    MERGE (a: Person {PersonId: '<redacted>'})
    MERGE (b: Person {PersonId: '<redacted>'})
    MERGE (a: Person {PersonId: '<redacted>'})-[e:edge_name]->(b: Person {PersonId: '<redacted>'})
     SET e.StartId = '<redacted>', e.EndId = '<redacted>'
     RETURN e $$)
     as (e agtype);
    

Query Plan:

"Custom Scan (Cypher Set)  (cost=0.00..0.00 rows=0 width=32) (actual time=822.972..822.984 rows=1 loops=1)"
"  ->  Subquery Scan on _age_default_alias_previous_cypher_clause  (cost=0.00..0.00 rows=1 width=32) (actual time=822.903..822.915 rows=1 loops=1)"
"        ->  Custom Scan (Cypher Merge)  (cost=0.00..0.00 rows=0 width=448) (actual time=822.901..822.911 rows=1 loops=1)"
"              ->  Subquery Scan on _age_default_alias_previous_cypher_clause_1  (cost=0.00..1475455.11 rows=73770921 width=32) (actual time=0.180..0.190 rows=1 loops=1)"
"                    ->  Hash Right Join  (cost=0.00..737745.90 rows=73770921 width=192) (actual time=0.180..0.188 rows=1 loops=1)"
"                          Hash Cond: ((e.start_id = (age_id(_age_default_alias_previous_cypher_clause_2.a))::graphid) AND (e.end_id = (age_id(_age_default_alias_previous_cypher_clause_2.b))::graphid))"
"                          Join Filter: ((age_properties(_age_default_alias_previous_cypher_clause_2.a) @> '{""PersonId"": ""<redacted>""}'::agtype) AND (age_properties(_age_default_alias_previous_cypher_clause_2.b) @> '{""PersonId"": ""<redacted""}'::agtype))"
"                          ->  Seq Scan on ""edge_name"" e  (cost=0.00..24.55 rows=970 width=48) (actual time=0.002..0.002 rows=0 loops=1)"
"                          ->  Hash  (cost=0.00..0.00 rows=0 width=64) (actual time=0.168..0.172 rows=1 loops=1)"
"                                Buckets: 1024  Batches: 1  Memory Usage: 9kB"
"                                ->  Custom Scan (Cypher Merge)  (cost=0.00..0.00 rows=0 width=64) (actual time=0.158..0.162 rows=1 loops=1)"
"                                      ->  Subquery Scan on _age_default_alias_previous_cypher_clause_2  (cost=130.57..1321589.78 rows=73770921 width=64) (actual time=0.158..0.162 rows=1 loops=1)"
"                                            ->  Nested Loop Left Join  (cost=130.57..583880.57 rows=73770921 width=128) (actual time=0.157..0.161 rows=1 loops=1)"
"                                                  ->  Custom Scan (Cypher Merge)  (cost=0.00..0.00 rows=0 width=32) (actual time=0.071..0.072 rows=1 loops=1)"
"                                                        ->  Subquery Scan on _age_default_alias_previous_cypher_clause_3  (cost=130.57..30641.61 rows=8589 width=32) (actual time=0.070..0.072 rows=1 loops=1)"
"                                                              ->  Bitmap Heap Scan on ""Person"" a  (cost=130.57..30555.72 rows=8589 width=64) (actual time=0.070..0.071 rows=1 loops=1)"
"                                                                    Recheck Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"                                                                    Heap Blocks: exact=1"
"                                                                    ->  Bitmap Index Scan on index_name  (cost=0.00..128.42 rows=8589 width=0) (actual time=0.061..0.061 rows=1 loops=1)"
"                                                                          Index Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"                                                  ->  Bitmap Heap Scan on ""Person"" b  (cost=130.57..30512.77 rows=8589 width=32) (actual time=0.083..0.084 rows=1 loops=1)"
"                                                        Recheck Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"                                                        Heap Blocks: exact=1"
"                                                        ->  Bitmap Index Scan on index_name (cost=0.00..128.42 rows=8589 width=0) (actual time=0.079..0.080 rows=1 loops=1)"
"                                                              Index Cond: (properties @> '{""PersonId"": ""<redacted>""}'::agtype)"
"Planning Time: 0.390 ms"
"Execution Time: 823.126 ms"

Without understanding your use case or dataset, it is hard to suggest alternatives.

We have 3 vertex graph tables, and they are each in the volume of around 8M nodes as of now, and we need to create edges to represent the relationships between them. Currently, there are 3 types of edges that needs to be created, and each edge graph table will create around 8M edges. There are 3 such edge graph tables, so in total there are approximately >24M edges to be created.

Currently, it still seems too slow with the MERGE clause and I am unsure if this is normal, but would there be any other best practices that we can try to apply to speed up this process?

ysvoon avatar Aug 06 '25 14:08 ysvoon

@ysvoon The set/merge issue #1907 was resolved but has not been released and is not part of 1.5.0.

I am working on starting the release of 1.6.0 in the next few weeks.

jrgemignani avatar Aug 08 '25 04:08 jrgemignani

@jrgemignani I see, thank you for the information, is there a set date for the 1.6.0 release to be pushed out?

Additionally, would like to ask if others been successful to use apache/age in a use case that is similar to ours based on your experience?

Thanks again for your time and advice :)

ysvoon avatar Aug 08 '25 05:08 ysvoon

@ysvoon No set date, unfortunately. I will try to look at what you posted above sometime tomorrow.

jrgemignani avatar Aug 08 '25 05:08 jrgemignani

@jrgemignani Hi, while you are checking the above comments, may I additionally ask if the 1.6.0 release will be backward compatible? For e.g.: might we need major code changes should we upgrade the newer 1.6.0 release and etc.

ysvoon avatar Aug 11 '25 04:08 ysvoon

@jrgemignani Hello, haven't heard back in some time, just wanted to check if everything's okay and if you've managed to have a look at the above comments? :)

ysvoon avatar Aug 18 '25 05:08 ysvoon

@jrgemignani Hi! Just bumping the issue, any feedback or suggestion from you will be greatly appreciated!

ysvoon avatar Sep 12 '25 07:09 ysvoon

@ysvoon I'm currently working on Apache AGE 1.6.0 for PostgreSQL 16. The voting is currently going on, so the release will likely be out this coming week. Provided everything goes well. I've also been working on some other performance issues.

jrgemignani avatar Sep 13 '25 07:09 jrgemignani

@jrgemignani Got it, thanks! May I ask what performance issues are you referring to and if they are related to any of the issues mentioned above in this GitHub issue?

ysvoon avatar Sep 17 '25 02:09 ysvoon

@ysvoon The perf. issues are general in nature and deal largely with MATCH. But, as MERGE is basically, MATCH and CREATE if not exists,... it might help.

For your query, have you tried btree indexes on each label table for the node's id and the edge's start_id and end_id columns?

jrgemignani avatar Sep 22 '25 22:09 jrgemignani

@jrgemignani That's good to hear, maybe it will improve the performance for our use case.

I have tried btree indexes on the label (vertex) tables, but it seems like I am only able to create btree indexes on the id key only, not the properties (as I am creating the edges based on one of the id keys within properties). When I did EXPLAIN ANALYZE, it always shows that seq scan was being performed, which resulted in a much worse performance as compared to using gin index on properties, which we can see was using the index created.

I also tried creating edges on the graph id instead of our custom key in properties with the btree index, but it was still performing sequential scans.

I did not create index within the edge table's start_id and end_id, as I am trying to create edges within this table itself.

ysvoon avatar Sep 24 '25 03:09 ysvoon

@ysvoon You should only use btree indexes on the id, start_id, and end_id columns. For properties, use a gin index.

For example, -

CREATE INDEX some_node_table_name_btree_id ON some_schema.some_node_table_name USING btree (id);
CREATE INDEX some_node_table_name_gin_properties ON some_schema.some_node_table_name USING gin (properties);
CREATE INDEX some_edge_table_name_btree_ids ON some_schema.some_edge_table_name USING btree (start_id, end_id);
CREATE INDEX some_edge_table_name_gin_properties ON some_schema.some_edge_table_name USING gin (properties);

jrgemignani avatar Sep 24 '25 18:09 jrgemignani

@ysvoon I want to note that if EXPLAIN shows that the plan will be using HASH joins, it will likely ignore some, or all, of the indexes.

jrgemignani avatar Sep 24 '25 18:09 jrgemignani

@ysvoon I wanted to let you know that the latest release of Apache AGE 1.6.0 is now available for PG16 and PG17.

Additionally, I've seen some work, that I need to review, that may significantly speed up the load process. That is all I can say atm, I wanted to let you know that I haven't forgotten about your issue :) But, I also don't want to get your hopes up too high, just yet.

jrgemignani avatar Oct 12 '25 17:10 jrgemignani