Major Performance Difference: SQL vs. Cypher for Aggregation/Ordering
I am executing a performance benchmarking for Apache AGE by using goodreads dataset. I've found a significant performance gap between a direct SQL query and its AGE/Cypher equivalent, specifically with aggregation, grouping, and ordering on a large dataset.
Here is my graph design:
Fast Direct PostgreSQL Query (3 seconds):
select count(*), u.id from "User" u, "HAS_INTERACTION" h, "Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
Slow AGE/Cypher Query (50+ seconds):
SELECT * FROM cypher('goodreads_graph', $$ MATCH (u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
I expect AGE/Cypher to be much closer in performance to direct SQL. Currently, the Cypher query is over 15x slower, which is a major issue.
I believe my sample Cypher query is similar to, and aligns with, the sorting on aggregate functions sample in the official AGE documentation. Is this a bug that will be addressed in an upcoming release?
I'm looking for guidance on how to optimize this Cypher query to achieve performance more comparable to the direct PostgreSQL query. Any recommendations on AGE-specific best practices, indexing for aggregations, or relevant configuration tuning would be greatly appreciated.
For reference, when the ORDER BY clause is removed from the AGE/Cypher query, its execution time significantly improves.
@serdarmicrosoft It shouldn't run that slow, we will need to look into this.
@serdarmicrosoft Could you add in the output of explain for this. Both for the sql and for the cypher -
EXPLAIN select count(*), u.id from "User" u, "HAS_INTERACTION" h, "Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
SELECT * FROM cypher('goodreads_graph', $$ EXPLAIN MATCH (u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
Also, if there are any indexes used, could you do this both before and after the index is added. That would be helpful.
@serdarmicrosoft In your SQL query, the tables are specified, but in Cypher query, the target label is not specified which causes age to scan all the vertex labels.
SQL query plan:
db=# EXPLAIN select count(*), u.id from issue_2194."User" u, issue_2194."HAS_INTERACTION" h, issue_2194."Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
..........................
-> Seq Scan on "User" u (cost=0.00..22.00 rows=1200 width=8)
...........
-> Seq Scan on "HAS_INTERACTION" h (cost=0.00..19.70 rows=970 width=16)
.........
-> Seq Scan on "Book" b (cost=0.00..22.00 rows=1200 width=8)
Cypher query plan:
db=# SELECT * FROM cypher('issue_2194', $$EXPLAIN MATCH (u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) as (user_id agtype, count agtype);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
............................................
-> Parallel Append (cost=0.00..78.24 rows=2001 width=8)
-> Parallel Seq Scan on "Book" _age_default_alias_1_2 (cost=0.00..17.06 rows=706 width=8)
-> Parallel Seq Scan on "Author" _age_default_alias_1_3 (cost=0.00..17.06 rows=706 width=8)
-> Parallel Seq Scan on "Review" _age_default_alias_1_4 (cost=0.00..17.06 rows=706 width=8)
-> Parallel Seq Scan on "User" _age_default_alias_1_5 (cost=0.00..17.06 rows=706 width=8)
-> Parallel Seq Scan on _ag_label_vertex _age_default_alias_1_1 (cost=0.00..0.00 rows=1 width=8)
.....
-> Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..19.70 rows=970
..........
-> Seq Scan on "User" u (cost=0.00..22.00 rows=1200 width=40)
Potential optimization is to explicitly include the label on the target node:
SELECT * FROM cypher('graph', $$
EXPLAIN
MATCH (u:User)-[:HAS_INTERACTION]->(:Book)
RETURN u.user_id, count(*)
ORDER BY count(*) DESC
LIMIT 10
$$) AS (user_id agtype, count agtype);
That said, I believe AGE should ideally be smart enough to infer and apply such label constraints automatically to reduce unnecessary scans when label filters are present elsewhere in the pattern.
Let us know if that improves some performance.
That said, I believe AGE should ideally be smart enough to infer and apply such label constraints automatically to reduce unnecessary scans when label filters are present elsewhere in the pattern.
I'm not sure that is possible to do. By stating ->() it means any end node and edges can terminate in any node.
Btw, that would also be true for ()-, coming from any node and -[], going through any edge. As each label is a table, you would want to restrict it where possible for better performance.
Thanks @MuhammadTahaNaveed and @jrgemignani We were able to significantly improve the query's performance from 50 seconds down to 10 seconds just by modifying () to (:Book). I'm curious if it's possible to further optimize this, ideally getting its performance closer to the PostgreSQL equivalent, which runs the same query in only 3.6 seconds.
Sorry @jrgemignani I missed your question.
Here is the output of the queries with indexes:
EXPLAIN select count(*), u.id from "User" u, "HAS_INTERACTION" h, "Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
`QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=728449.72..728449.74 rows=10 width=16)
-> Sort (cost=728449.72..730061.43 rows=644686 width=16)
Sort Key: (count(*)) DESC
-> Finalize HashAggregate (cost=695479.90..714518.28 rows=644686 width=16)
Group Key: u.id
Planned Partitions: 16
-> Gather (cost=470126.84..638062.55 rows=1289372 width=16)
Workers Planned: 2
-> Partial HashAggregate (cost=469126.84..508125.35 rows=644686 width=16)
Group Key: u.id
Planned Partitions: 8
-> Parallel Hash Join (cost=15770.68..299858.22 rows=4166612 width=8)
Hash Cond: (h.start_id = u.id)
-> Parallel Hash Join (cost=1875.90..219375.78 rows=4166612 width=8)
Hash Cond: (h.end_id = b.id)
-> Parallel Seq Scan on "HAS_INTERACTION" h (cost=0.00..184514.12 rows=4166612 width=16)
-> Parallel Hash (cost=1389.45..1389.45 rows=38916 width=8)
-> Parallel Index Only Scan using "Book_id_idx" on "Book" b (cost=0.29..1389.45 rows=38916 width=8)
-> Parallel Hash (cost=9487.05..9487.05 rows=268619 width=8)
-> Parallel Index Only Scan using "User_id_idx" on "User" u (cost=0.42..9487.05 rows=268619 width=8)
(20 rows)`
SELECT * FROM cypher('goodreads_graph', $$ EXPLAIN MATCH (u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
`QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=34254481.74..34254481.76 rows=10 width=64)
-> Sort (cost=34254481.74..34256093.45 rows=644686 width=64)
Sort Key: ((count(*))::agtype) DESC
-> GroupAggregate (cost=1109643.97..34240550.31 rows=644686 width=64)
Group Key: agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('1332417'::oid, u.id), u.properties), '"user_id"'::agtype])
-> Nested Loop (cost=1109643.97..21754040.28 rows=2494723262 width=32)
-> Gather Merge (cost=1109643.96..2274293.50 rows=9999869 width=118)
Workers Planned: 2
-> Sort (cost=1108643.94..1119060.47 rows=4166612 width=118)
Sort Key: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('1332417'::oid, u.id), u.properties), '"user_id"'::agtype]))
-> Parallel Hash Join (cost=21655.93..284305.72 rows=4166612 width=118)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184514.12 rows=4166612 width=16)
-> Parallel Hash (cost=13838.19..13838.19 rows=268619 width=110)
-> Parallel Seq Scan on "User" u (cost=0.00..13838.19 rows=268619 width=110)
-> Memoize (cost=0.01..1.54 rows=5 width=8)
Cache Key: _age_default_alias_0.end_id
Cache Mode: logical
-> Append (cost=0.00..1.53 rows=5 width=8)
-> Seq Scan on _ag_label_vertex _age_default_alias_1_1 (cost=0.00..0.00 rows=1 width=8)
Filter: (id = _age_default_alias_0.end_id)
-> Index Only Scan using "User_id_idx" on "User" _age_default_alias_1_2 (cost=0.42..0.44 rows=1 width=8)
Index Cond: (id = _age_default_alias_0.end_id)
-> Index Only Scan using "Book_id_idx" on "Book" _age_default_alias_1_3 (cost=0.29..0.31 rows=1 width=8)
Index Cond: (id = _age_default_alias_0.end_id)
-> Index Only Scan using "Author_id_idx" on "Author" _age_default_alias_1_4 (cost=0.29..0.31 rows=1 width=8)
Index Cond: (id = _age_default_alias_0.end_id)
-> Index Only Scan using "Review_id_idx" on "Review" _age_default_alias_1_5 (cost=0.43..0.45 rows=1 width=8)
Index Cond: (id = _age_default_alias_0.end_id)
(29 rows)`
without indexes: ( I deleted the indexes for User, Book and HAS_INTERACTION)
EXPLAIN select count(*), u.id from "User" u, "HAS_INTERACTION" h, "Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
` QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Limit (cost=741941.63..741941.66 rows=10 width=16)
-> Sort (cost=741941.63..743553.35 rows=644686 width=16)
Sort Key: (count(*)) DESC
-> Finalize HashAggregate (cost=708971.82..728010.20 rows=644686 width=16)
Group Key: u.id
Planned Partitions: 16
-> Gather (cost=483618.75..651554.47 rows=1289372 width=16)
Workers Planned: 2
-> Partial HashAggregate (cost=482618.75..521617.27 rows=644686 width=16)
Group Key: u.id
Planned Partitions: 8
-> Parallel Hash Join (cost=22114.08..313350.14 rows=4166612 width=8)
Hash Cond: (h.start_id = u.id)
-> Parallel Hash Join (cost=3868.15..228516.55 rows=4166612 width=8)
Hash Cond: (h.end_id = b.id)
-> Parallel Seq Scan on "HAS_INTERACTION" h (cost=0.00..184514.12 rows=4166612 width=16)
-> Parallel Hash (cost=3181.40..3181.40 rows=54940 width=8)
-> Parallel Seq Scan on "Book" b (cost=0.00..3181.40 rows=54940 width=8)
-> Parallel Hash (cost=13838.19..13838.19 rows=268619 width=8)
-> Parallel Seq Scan on "User" u (cost=0.00..13838.19 rows=268619 width=8)
(20 rows)`
SELECT * FROM cypher('goodreads_graph', $$ EXPLAIN MATCH (u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
`QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=93904932.55..93904932.57 rows=10 width=64)
-> Sort (cost=93904932.55..93906544.26 rows=644686 width=64)
Sort Key: ((count(*))::agtype) DESC
-> Finalize GroupAggregate (cost=93721223.30..93891001.11 rows=644686 width=64)
Group Key: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('1332417'::oid, u.id), u.properties), '"user_id"'::agtype]))
-> Gather Merge (cost=93721223.30..93871660.53 rows=1289372 width=40)
Workers Planned: 2
-> Sort (cost=93720223.27..93721834.99 rows=644686 width=40)
Sort Key: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('1332417'::oid, u.id), u.properties), '"user_id"'::agtype]))
-> Partial HashAggregate (cost=79422665.33..93645424.26 rows=644686 width=40)
Group Key: agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('1332417'::oid, u.id), u.properties), '"user_id"'::agtype])
Planned Partitions: 16
-> Merge Join (cost=1596276.56..25013010.85 rows=1039468026 width=32)
Merge Cond: (_age_default_alias_0.end_id = _age_default_alias_1.id)
-> Sort (cost=1108643.94..1119060.47 rows=4166612 width=118)
Sort Key: _age_default_alias_0.end_id
-> Parallel Hash Join (cost=21655.93..284305.72 rows=4166612 width=118)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184514.12 rows=4166612 width=16)
-> Parallel Hash (cost=13838.19..13838.19 rows=268619 width=110)
-> Parallel Seq Scan on "User" u (cost=0.00..13838.19 rows=268619 width=110)
-> Materialize (cost=487632.62..503374.53 rows=3148382 width=8)
-> Sort (cost=487632.62..495503.58 rows=3148382 width=8)
Sort Key: _age_default_alias_1.id
-> Append (cost=0.00..86329.93 rows=3148382 width=8)
-> Seq Scan on _ag_label_vertex _age_default_alias_1_1 (cost=0.00..0.00 rows=1 width=8)
-> Seq Scan on "User" _age_default_alias_1_2 (cost=0.00..17598.86 rows=644686 width=8)
-> Seq Scan on "Book" _age_default_alias_1_3 (cost=0.00..3565.98 rows=93398 width=8)
-> Index Only Scan using "Author_id_idx" on "Author" _age_default_alias_1_4 (cost=0.29..503.19 rows=23393 width=8)
-> Index Only Scan using "Review_id_idx" on "Review" _age_default_alias_1_5 (cost=0.43..48920.00 rows=2386904 width=8)
(30 rows)`
Here is the indexes I used:
-- Index on 'id' for all vertex labels
CREATE INDEX IF NOT EXISTS idx_user_id ON goodreads_graph."User" USING BTREE (id);
CREATE INDEX IF NOT EXISTS idx_review_id ON goodreads_graph."Review" USING BTREE (id);
CREATE INDEX IF NOT EXISTS idx_book_id ON goodreads_graph."Book" USING BTREE (id);
CREATE INDEX IF NOT EXISTS idx_author_id ON goodreads_graph."Author" USING BTREE (id);
CREATE INDEX ON goodreads_graph."User" USING GIN (properties);
CREATE INDEX ON goodreads_graph."Review" USING GIN (properties);
CREATE INDEX ON goodreads_graph."Book" USING GIN (properties);
CREATE INDEX ON goodreads_graph."Author" USING GIN (properties);
-- Index for WHERE clause: u.user_id = '...'
CREATE INDEX IF NOT EXISTS idx_user_userid_op ON goodreads_graph."User" USING BTREE (agtype_access_operator(VARIADIC ARRAY[properties, '"user_id"'::agtype]));
-- Index for WHERE clause: b.book_id = '...'
CREATE INDEX IF NOT EXISTS idx_book_bookid_op ON goodreads_graph."Book" USING BTREE (agtype_access_operator(VARIADIC ARRAY[properties, '"book_id"'::agtype]));
-- WRITTEN_BY edge: Review -[:WRITTEN_BY]-> User
CREATE INDEX IF NOT EXISTS idx_written_by_start ON goodreads_graph."WRITTEN_BY" USING BTREE (start_id);
CREATE INDEX IF NOT EXISTS idx_written_by_end ON goodreads_graph."WRITTEN_BY" USING BTREE (end_id);
-- HAS_REVIEW edge: Book -[:HAS_REVIEW]-> Review
CREATE INDEX IF NOT EXISTS idx_has_review_start ON goodreads_graph."HAS_REVIEW" USING BTREE (start_id);
CREATE INDEX IF NOT EXISTS idx_has_review_end ON goodreads_graph."HAS_REVIEW" USING BTREE (end_id);
-- Index for WHERE clause: r.review_id = '...'
CREATE INDEX IF NOT EXISTS idx_review_reviewid_op ON goodreads_graph."Review" USING BTREE (agtype_access_operator(VARIADIC ARRAY[properties, '"review_id"'::agtype]));
-- AUTHORED_BY edge: Book -[:AUTHORED_BY]-> Author
CREATE INDEX IF NOT EXISTS idx_authored_by_start ON goodreads_graph."AUTHORED_BY" USING BTREE (start_id);
CREATE INDEX IF NOT EXISTS idx_authored_by_end ON goodreads_graph."AUTHORED_BY" USING BTREE (end_id);
-- HAS_INTERACTION edge: User -[:HAS_INTERACTION]-> Book
CREATE INDEX IF NOT EXISTS idx_has_interaction_start ON goodreads_graph."HAS_INTERACTION" USING BTREE (start_id);
CREATE INDEX IF NOT EXISTS idx_has_interaction_end ON goodreads_graph."HAS_INTERACTION" USING BTREE (end_id);
I'm curious if it's possible to further optimize this, ideally getting its performance closer to the PostgreSQL equivalent, which runs the same query in only 3.6 seconds.
@serdarmicrosoft Atm, I can't think of any further optimizations given the dataset and structure. Once I get the dataset loaded and poke around, I might have some ideas.
@serdarmicrosoft @MuhammadTahaNaveed From my local system with the above graph loaded (PG17, CentOS 10)-
Graph statistics
psql-17.5-5432-pgsql=# select * from cypher('goodreads_graph_x', $$ return graph_stats('goodreads_graph_x') $$) as (results agtype);
results
----------------------------------------------------------------------------------------------
{"graph": "goodreads_graph_x", "num_loaded_edges": 14897522, "num_loaded_vertices": 3151377}
(1 row)
psql-17.5-5432-pgsql=#
Open end vertex
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$ explain analyze MATCH (u:User)-[:HAS_INTERACTION]->() RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------
Limit (cost=34001551.37..34001551.40 rows=10 width=64) (actual time=108330.804..108352.939 rows=10 loops=1)
-> Sort (cost=34001551.37..34003163.09 rows=644686 width=64) (actual time=108330.802..108352.936 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 27kB
-> GroupAggregate (cost=1256126.22..33987619.94 rows=644686 width=64) (actual time=33750.560..108224.304 rows=132661 loops=1)
Group Key: agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties), '"user_id"'::agtype]
)
-> Nested Loop (cost=1256126.22..21661006.56 rows=2462743931 width=32) (actual time=33747.672..103070.495 rows=10000000 loops=1)
-> Gather Merge (cost=1256126.21..2420813.48 rows=10000193 width=118) (actual time=33747.455..54797.290 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=1255126.18..1265543.05 rows=4166747 width=118) (actual time=33671.073..45033.045 rows=3333333 loops=3)
Sort Key: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties), '"
user_id"'::agtype]))
Sort Method: external merge Disk: 549696kB
Worker 0: Sort Method: external merge Disk: 566440kB
Worker 1: Sort Method: external merge Disk: 557168kB
-> Parallel Hash Join (cost=21619.93..284274.14 rows=4166747 width=118) (actual time=1721.984..15407.078 rows=3333333 loops=3
)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747 width=16) (actual tim
e=0.038..838.437 rows=3333333 loops=3)
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=110) (actual time=112.966..112.967 rows=214895 loops=3)
Buckets: 65536 (originally 65536) Batches: 32 (originally 16) Memory Usage: 9792kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=110) (actual time=0.048..41.160 rows=2148
95 loops=3)
-> Memoize (cost=0.01..1.54 rows=5 width=8) (actual time=0.001..0.001 rows=1 loops=10000000)
Cache Key: _age_default_alias_0.end_id
Cache Mode: logical
Hits: 9907870 Misses: 92130 Evictions: 17232 Overflows: 0 Memory Usage: 8193kB
-> Append (cost=0.00..1.53 rows=5 width=8) (actual time=0.013..0.023 rows=1 loops=92130)
-> Seq Scan on _ag_label_vertex _age_default_alias_1_1 (cost=0.00..0.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loop
s=92130)
Filter: (id = _age_default_alias_0.end_id)
-> Index Only Scan using "User_id_idx" on "User" _age_default_alias_1_2 (cost=0.42..0.44 rows=1 width=8) (actual time=0.006..
0.006 rows=0 loops=92130)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using "Book_id_idx" on "Book" _age_default_alias_1_3 (cost=0.29..0.31 rows=1 width=8) (actual time=0.005..
0.005 rows=1 loops=92130)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using "Author_id_idx" on "Author" _age_default_alias_1_4 (cost=0.29..0.31 rows=1 width=8) (actual time=0.0
04..0.004 rows=0 loops=92130)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using "Review_id_idx" on "Review" _age_default_alias_1_5 (cost=0.43..0.45 rows=1 width=8) (actual time=0.0
05..0.005 rows=0 loops=92130)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
Planning Time: 1.962 ms
Execution Time: 108387.898 ms
(42 rows)
psql-17.5-5432-pgsql=#
Set end vertex
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$ explain analyze MATCH (u:User)-[:HAS_INTERACTION]->(:Book) RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------
Limit (cost=1007580.65..1007580.67 rows=10 width=64) (actual time=21179.688..21199.448 rows=10 loops=1)
-> Sort (cost=1007580.65..1009192.36 rows=644686 width=64) (actual time=21179.686..21199.445 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 27kB
-> Finalize GroupAggregate (cost=823871.40..993649.21 rows=644686 width=64) (actual time=20646.968..21112.283 rows=132661 loops=1)
Group Key: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties), '"user_id"'::agtype
]))
-> Gather Merge (cost=823871.40..974308.63 rows=1289372 width=40) (actual time=20646.945..20848.792 rows=132820 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=822871.37..824483.09 rows=644686 width=40) (actual time=20607.823..20674.302 rows=44273 loops=3)
Sort Key: (agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties), '"user_i
d"'::agtype]))
Sort Method: quicksort Memory: 4030kB
Worker 0: Sort Method: external merge Disk: 2896kB
Worker 1: Sort Method: external merge Disk: 2928kB
-> Partial HashAggregate (cost=674786.11..743035.36 rows=644686 width=40) (actual time=20149.603..20164.340 rows=44273 loops=3)
Group Key: agtype_access_operator(VARIADIC ARRAY[_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties), '"
user_id"'::agtype])
Planned Partitions: 16 Batches: 1 Memory Usage: 6417kB
Worker 0: Batches: 1 Memory Usage: 6673kB
Worker 1: Batches: 1 Memory Usage: 6673kB
-> Parallel Hash Join (cost=23994.82..342748.46 rows=4166747 width=32) (actual time=2795.794..16327.732 rows=3333333 loops=3)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Hash Join (cost=2374.89..219877.18 rows=4166747 width=8) (actual time=18.239..1938.707 rows=3333333 loops=3
)
Hash Cond: (_age_default_alias_0.end_id = _age_default_alias_1.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747 width=16) (actu
al time=0.025..524.881 rows=3333333 loops=3)
-> Parallel Hash (cost=1888.44..1888.44 rows=38916 width=8) (actual time=17.815..17.817 rows=31133 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4768kB
-> Parallel Index Only Scan using "Book_id_idx" on "Book" _age_default_alias_1 (cost=0.29..1888.44 rows=389
16 width=8) (actual time=0.089..7.761 rows=31133 loops=3)
Heap Fetches: 0
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=110) (actual time=123.008..123.009 rows=214895 loops=3)
Buckets: 65536 (originally 65536) Batches: 32 (originally 16) Memory Usage: 9792kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=110) (actual time=0.045..44.468 rows=2148
95 loops=3)
Planning Time: 6.800 ms
Execution Time: 21202.833 ms
(33 rows)
psql-17.5-5432-pgsql=#
SQL command
psql-17.5-5432-pgsql=# explain analyze select count(*), u.id from goodreads_graph_x."User" u, goodreads_graph_x."HAS_INTERACTION" h, goodreads_graph_x."Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------
Limit (cost=828731.73..828731.75 rows=10 width=16) (actual time=5160.862..5183.487 rows=10 loops=1)
-> Sort (cost=828731.73..830343.44 rows=644686 width=16) (actual time=5160.860..5183.484 rows=10 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> Finalize GroupAggregate (cost=651469.34..814800.29 rows=644686 width=16) (actual time=5030.692..5154.455 rows=132661 loops=1)
Group Key: u.id
-> Gather Merge (cost=651469.34..801906.57 rows=1289372 width=16) (actual time=5030.671..5092.293 rows=135523 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=650469.31..652081.03 rows=644686 width=16) (actual time=5023.358..5031.229 rows=45174 loops=3)
Sort Key: u.id
Sort Method: quicksort Memory: 2980kB
Worker 0: Sort Method: quicksort Memory: 2901kB
Worker 1: Sort Method: quicksort Memory: 2963kB
-> Partial HashAggregate (cost=538245.23..577244.80 rows=644686 width=16) (actual time=4987.979..5002.350 rows=45174 loops=3)
Group Key: u.id
Planned Partitions: 8 Batches: 1 Memory Usage: 6929kB
Worker 0: Batches: 1 Memory Usage: 6673kB
Worker 1: Batches: 1 Memory Usage: 6673kB
-> Parallel Hash Join (cost=19772.67..303865.71 rows=4166747 width=8) (actual time=2829.474..3983.936 rows=3333333 loops=3)
Hash Cond: (h.start_id = u.id)
-> Parallel Hash Join (cost=2374.89..219877.18 rows=4166747 width=8) (actual time=17.717..1958.554 rows=3333333 loops=3
)
Hash Cond: (h.end_id = b.id)
-> Parallel Seq Scan on "HAS_INTERACTION" h (cost=0.00..184515.47 rows=4166747 width=16) (actual time=0.057..527.
239 rows=3333333 loops=3)
-> Parallel Hash (cost=1888.44..1888.44 rows=38916 width=8) (actual time=16.894..16.895 rows=31133 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4736kB
-> Parallel Index Only Scan using "Book_id_idx" on "Book" b (cost=0.29..1888.44 rows=38916 width=8) (actual
time=0.054..7.187 rows=31133 loops=3)
Heap Fetches: 0
-> Parallel Hash (cost=12990.05..12990.05 rows=268619 width=8) (actual time=117.643..117.644 rows=214895 loops=3)
Buckets: 262144 (originally 262144) Batches: 16 (originally 8) Memory Usage: 12352kB
-> Parallel Index Only Scan using "User_id_idx" on "User" u (cost=0.42..12990.05 rows=268619 width=8) (actual tim
e=0.080..52.500 rows=214895 loops=3)
Heap Fetches: 0
Planning Time: 0.899 ms
Execution Time: 5186.043 ms
(34 rows)
psql-17.5-5432-pgsql=#
@serdarmicrosoft I should note that the SQL command above references Book, so it is more comparable to the MATCH that does the same.
@serdarmicrosoft I need to point out that these 2 commands are still not quite the same -
psql-17.5-5432-pgsql=# select count(*), u.id from goodreads_graph_x."User" u, goodreads_graph_x."HAS_INTERACTION" h, goodreads_graph_x."Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY 1 DESC LIMIT 10;
count | id
-------+-----------------
8889 | 844424930606649
8328 | 844424930455288
6882 | 844424930701142
6793 | 844424930595806
6165 | 844424930340731
5915 | 844424930542262
5899 | 844424930582002
5858 | 844424930295719
5699 | 844424930678986
5649 | 844424930222488
(10 rows)
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$ MATCH (u:User)-[:HAS_INTERACTION]->(:Book) RETURN u.user_id, count(*) ORDER BY count(*) DESC LIMIT 10 $$) AS (user_id agtype, interaction_count agtype);
user_id | interaction_count
------------------------------------+-------------------
"bc7862cf7449815372fc58c8a817b488" | 8889
"806c0fdc5bed9757ed4d3ca8e3a13be5" | 8328
"e1f84cc029a12d6b57ad10432a0d3649" | 6882
"b83502f98865ff9ed70755404c353b56" | 6793
"52f5430e583a5e0043b3e6c83953b68a" | 6165
"a2ed685cf6398e72c5ec21a556489761" | 5915
"b2b770716941c4aab7227ac62230cc7f" | 5899
"410ffd0e48b0899ec40b54e571c4607d" | 5858
"d93961c6b197f33c3747ceba6fafbbdc" | 5699
"2406ed0afc9f589ec7bf0766242371e0" | 5649
(10 rows)
u.id and user_id are different. u.id is the id of the vertex, user_id is the id inside the vertex properties.
This would be the comparable command -
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$ MATCH (u:User)-[:HAS_INTERACTION]->(:Book) RETURN count(*), id(u) ORDER BY count(*) DESC LIMIT 10 $$) AS (count agtype, id agtype);
count | id
-------+-----------------
8889 | 844424930606649
8328 | 844424930455288
6882 | 844424930701142
6793 | 844424930595806
6165 | 844424930340731
5915 | 844424930542262
5899 | 844424930582002
5858 | 844424930295719
5699 | 844424930678986
5649 | 844424930222488
(10 rows)
Here is the explain analyze for this command -
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$ explain analyze MATCH (u:User)-[:HAS_INTERACTION]->(:Book) RETURN count(*), id(u) ORDER BY count(*) DESC LIMIT 10 $$) AS (count agtype, id agtype);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------
Limit (cost=1007580.65..1007580.67 rows=10 width=64) (actual time=18006.682..18027.741 rows=10 loops=1)
-> Sort (cost=1007580.65..1009192.36 rows=644686 width=64) (actual time=18006.679..18027.738 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Finalize GroupAggregate (cost=823871.40..993649.21 rows=644686 width=64) (actual time=17584.455..17949.704 rows=132661 loops=1)
Group Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
-> Gather Merge (cost=823871.40..974308.63 rows=1289372 width=40) (actual time=17584.430..17745.016 rows=132983 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=822871.37..824483.09 rows=644686 width=40) (actual time=17565.748..17572.804 rows=44328 loops=3)
Sort Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
Sort Method: quicksort Memory: 3447kB
Worker 0: Sort Method: quicksort Memory: 3322kB
Worker 1: Sort Method: quicksort Memory: 3562kB
-> Partial HashAggregate (cost=674786.11..743035.36 rows=644686 width=40) (actual time=17277.242..17291.670 rows=44328 loops=3)
Group Key: age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties))
Planned Partitions: 16 Batches: 1 Memory Usage: 6417kB
Worker 0: Batches: 1 Memory Usage: 6161kB
Worker 1: Batches: 1 Memory Usage: 6929kB
-> Parallel Hash Join (cost=23994.82..342748.46 rows=4166747 width=32) (actual time=2706.845..13932.719 rows=3333333 loops=3)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Hash Join (cost=2374.89..219877.18 rows=4166747 width=8) (actual time=17.393..1871.862 rows=3333333 loops=3
)
Hash Cond: (_age_default_alias_0.end_id = _age_default_alias_1.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747 width=16) (actu
al time=0.032..503.387 rows=3333333 loops=3)
-> Parallel Hash (cost=1888.44..1888.44 rows=38916 width=8) (actual time=16.974..16.975 rows=31133 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4768kB
-> Parallel Index Only Scan using idx_book_id on "Book" _age_default_alias_1 (cost=0.29..1888.44 rows=38916
width=8) (actual time=0.067..7.070 rows=31133 loops=3)
Heap Fetches: 0
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=110) (actual time=112.913..112.914 rows=214895 loops=3)
Buckets: 65536 (originally 65536) Batches: 32 (originally 16) Memory Usage: 9792kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=110) (actual time=0.047..40.515 rows=2148
95 loops=3)
Planning Time: 1.131 ms
Execution Time: 18030.600 ms
(33 rows)
@serdarmicrosoft Looking at the above I see something that could be contributing to the extra time needed to execute the cypher command -
...
Group Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
...
Sort Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
...
I need to see if there is an alternative way to do this.
@serdarmicrosoft I am working on a potential fix to this issue. My tests, so far, show nearly identical execution times between the two queries.
@serdarmicrosoft PR #2199 adds additional variables that can have a huge impact on the performance of this query. Below I have added the output of explain analyze on my server for the above query -
SQL baseline
psql-17.5-5432-pgsql=# explain analyze select count(*), u.id from goodreads_graph_x."User" u, goodreads_graph_x."HAS_INTERACTION" h, goodreads_graph_x."Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY count(*) DESC LIMIT 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------
Limit (cost=828731.73..828731.75 rows=10 width=16) (actual time=4988.478..5016.965 rows=10 loops=1)
-> Sort (cost=828731.73..830343.44 rows=644686 width=16) (actual time=4988.476..5016.962 rows=10 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> Finalize GroupAggregate (cost=651469.34..814800.29 rows=644686 width=16) (actual time=4860.272..4986.960 rows=132661 loops=1)
Group Key: u.id
-> Gather Merge (cost=651469.34..801906.57 rows=1289372 width=16) (actual time=4860.263..4925.715 rows=135627 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=650469.31..652081.03 rows=644686 width=16) (actual time=4852.214..4858.651 rows=45209 loops=3)
Sort Key: u.id
Sort Method: quicksort Memory: 2966kB
Worker 0: Sort Method: quicksort Memory: 2954kB
Worker 1: Sort Method: quicksort Memory: 2927kB
-> Partial HashAggregate (cost=538245.23..577244.80 rows=644686 width=16) (actual time=4816.899..4831.099 rows=45209 loops=3)
Group Key: u.id
Planned Partitions: 8 Batches: 1 Memory Usage: 6673kB
Worker 0: Batches: 1 Memory Usage: 6673kB
Worker 1: Batches: 1 Memory Usage: 6673kB
-> Parallel Hash Join (cost=19772.67..303865.71 rows=4166747 width=8) (actual time=2707.748..3833.753 rows=3333333 loops=3)
Hash Cond: (h.start_id = u.id)
-> Parallel Hash Join (cost=2374.89..219877.18 rows=4166747 width=8) (actual time=17.647..1876.387 rows=3333333 loops=3
)
Hash Cond: (h.end_id = b.id)
-> Parallel Seq Scan on "HAS_INTERACTION" h (cost=0.00..184515.47 rows=4166747 width=16) (actual time=0.039..507.
552 rows=3333333 loops=3)
-> Parallel Hash (cost=1888.44..1888.44 rows=38916 width=8) (actual time=16.853..16.854 rows=31133 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4768kB
-> Parallel Index Only Scan using idx_book_id on "Book" b (cost=0.29..1888.44 rows=38916 width=8) (actual t
ime=0.049..7.166 rows=31133 loops=3)
Heap Fetches: 0
-> Parallel Hash (cost=12990.05..12990.05 rows=268619 width=8) (actual time=113.746..113.747 rows=214895 loops=3)
Buckets: 262144 (originally 262144) Batches: 16 (originally 8) Memory Usage: 12352kB
-> Parallel Index Only Scan using "User_id_idx" on "User" u (cost=0.42..12990.05 rows=268619 width=8) (actual tim
e=0.046..50.956 rows=214895 loops=3)
Heap Fetches: 0
Planning Time: 0.929 ms
Execution Time: 5019.400 ms
(34 rows)
psql-17.5-5432-pgsql=#
Original Cypher query without Book endpoint
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$explain analyze MATCH (u:User)-[:HAS_INTERACTION]->() RETURN count(*), id(u) ORDER BY count(*) DESC LIMIT 10 $$) AS (count agtype, id agtype);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------
Limit (cost=34041184.37..34041184.40 rows=10 width=64) (actual time=82382.918..82407.803 rows=10 loops=1)
-> Sort (cost=34041184.37..34042796.09 rows=644686 width=64) (actual time=82382.916..82407.800 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 26kB
-> GroupAggregate (cost=1256126.22..34027252.94 rows=644686 width=64) (actual time=25704.595..82308.264 rows=132661 loops=1)
Group Key: age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties))
-> Nested Loop (cost=1256126.22..21684786.36 rows=2465914571 width=32) (actual time=25702.797..79040.874 rows=10000000 loops=1)
-> Gather Merge (cost=1256126.21..2420813.48 rows=10000193 width=118) (actual time=25702.610..40552.783 rows=10000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=1255126.18..1265543.05 rows=4166747 width=118) (actual time=25658.964..33652.909 rows=3333333 loops=3)
Sort Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
Sort Method: external merge Disk: 483984kB
Worker 0: Sort Method: external merge Disk: 491056kB
Worker 1: Sort Method: external merge Disk: 502712kB
-> Parallel Hash Join (cost=21619.93..284274.14 rows=4166747 width=118) (actual time=1935.223..13026.837 rows=3333333 loops=3
)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747 width=16) (actual tim
e=0.050..845.514 rows=3333333 loops=3)
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=110) (actual time=318.674..318.675 rows=214895 loops=3)
Buckets: 65536 (originally 65536) Batches: 32 (originally 16) Memory Usage: 9824kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=110) (actual time=0.160..199.981 rows=214
895 loops=3)
-> Memoize (cost=0.01..1.54 rows=5 width=8) (actual time=0.001..0.001 rows=1 loops=10000000)
Cache Key: _age_default_alias_0.end_id
Cache Mode: logical
Hits: 9907873 Misses: 92127 Evictions: 17229 Overflows: 0 Memory Usage: 8193kB
-> Append (cost=0.00..1.53 rows=5 width=8) (actual time=0.011..0.020 rows=1 loops=92127)
-> Seq Scan on _ag_label_vertex _age_default_alias_1_1 (cost=0.00..0.00 rows=1 width=8) (actual time=0.000..0.000 rows=0 loop
s=92127)
Filter: (id = _age_default_alias_0.end_id)
-> Index Only Scan using "User_id_idx" on "User" _age_default_alias_1_2 (cost=0.42..0.44 rows=1 width=8) (actual time=0.005..
0.005 rows=0 loops=92127)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using idx_book_id on "Book" _age_default_alias_1_3 (cost=0.29..0.31 rows=1 width=8) (actual time=0.004..0.
005 rows=1 loops=92127)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using idx_author_id on "Author" _age_default_alias_1_4 (cost=0.29..0.31 rows=1 width=8) (actual time=0.003
..0.003 rows=0 loops=92127)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using idx_review_id on "Review" _age_default_alias_1_5 (cost=0.43..0.45 rows=1 width=8) (actual time=0.005
..0.005 rows=0 loops=92127)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
Planning Time: 1.476 ms
Execution Time: 82440.110 ms
(42 rows)
psql-17.5-5432-pgsql=#
Original Cypher query with Book endpoint
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$explain analyze MATCH (u:User)-[:HAS_INTERACTION]->(:Book) RETURN count(*), id(u) ORDER BY count(*) DESC LIMIT 10 $$) AS (count agtype, id agtype);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------
Limit (cost=1007580.65..1007580.67 rows=10 width=64) (actual time=18222.030..18248.143 rows=10 loops=1)
-> Sort (cost=1007580.65..1009192.36 rows=644686 width=64) (actual time=18222.029..18248.140 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Finalize GroupAggregate (cost=823871.40..993649.21 rows=644686 width=64) (actual time=17789.320..18166.135 rows=132661 loops=1)
Group Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
-> Gather Merge (cost=823871.40..974308.63 rows=1289372 width=40) (actual time=17789.302..17949.998 rows=132923 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=822871.37..824483.09 rows=644686 width=40) (actual time=17762.365..17772.000 rows=44308 loops=3)
Sort Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
Sort Method: quicksort Memory: 3503kB
Worker 0: Sort Method: quicksort Memory: 3432kB
Worker 1: Sort Method: quicksort Memory: 3393kB
-> Partial HashAggregate (cost=674786.11..743035.36 rows=644686 width=40) (actual time=17462.389..17476.932 rows=44308 loops=3)
Group Key: age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties))
Planned Partitions: 16 Batches: 1 Memory Usage: 6673kB
Worker 0: Batches: 1 Memory Usage: 6417kB
Worker 1: Batches: 1 Memory Usage: 6417kB
-> Parallel Hash Join (cost=23994.82..342748.46 rows=4166747 width=32) (actual time=2755.273..14367.058 rows=3333333 loops=3)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Hash Join (cost=2374.89..219877.18 rows=4166747 width=8) (actual time=17.622..1915.252 rows=3333333 loops=3
)
Hash Cond: (_age_default_alias_0.end_id = _age_default_alias_1.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747 width=16) (actu
al time=0.024..519.098 rows=3333333 loops=3)
-> Parallel Hash (cost=1888.44..1888.44 rows=38916 width=8) (actual time=17.200..17.201 rows=31133 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4736kB
-> Parallel Index Only Scan using idx_book_id on "Book" _age_default_alias_1 (cost=0.29..1888.44 rows=38916
width=8) (actual time=0.070..7.131 rows=31133 loops=3)
Heap Fetches: 0
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=110) (actual time=118.590..118.590 rows=214895 loops=3)
Buckets: 65536 (originally 65536) Batches: 32 (originally 16) Memory Usage: 9792kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=110) (actual time=0.046..44.205 rows=2148
95 loops=3)
Planning Time: 2.422 ms
Execution Time: 18251.645 ms
(33 rows)
psql-17.5-5432-pgsql=#
Cypher query without Book endpoint using the new vars
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$explain analyze MATCH (u:User)-[:HAS_INTERACTION]->() RETURN count(*), u_idc ORDER BY count(*) DESC LIMIT 10 $$) AS (count agtype, u_idc agtype);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------
Subquery Scan on _ (cost=6539203.76..6539203.81 rows=10 width=64) (actual time=10232.518..10232.605 rows=10 loops=1)
-> Limit (cost=6539203.76..6539203.79 rows=10 width=40) (actual time=10232.514..10232.595 rows=10 loops=1)
-> Sort (cost=6539203.76..6540815.48 rows=644686 width=40) (actual time=10232.512..10232.592 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Finalize GroupAggregate (cost=786218.01..6525272.33 rows=644686 width=40) (actual time=2704.329..10140.579 rows=132661 loops=1)
Group Key: u.id
-> Gather Merge (cost=786218.01..6510766.89 rows=1289372 width=16) (actual time=2704.314..10002.115 rows=133999 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=785217.99..6360941.35 rows=644686 width=16) (actual time=2400.315..9630.184 rows=44666 loops=3)
Group Key: u.id
-> Nested Loop (cost=785217.99..1217172.46 rows=1027464405 width=8) (actual time=2399.584..9121.389 rows=3333333 loops=3)
-> Merge Join (cost=785217.98..885067.31 rows=4166747 width=16) (actual time=2399.385..5151.825 rows=3333333 loops=3)
Merge Cond: (_age_default_alias_0.start_id = u.id)
-> Sort (cost=785081.51..795498.38 rows=4166747 width=16) (actual time=2399.313..3120.343 rows=3333333 loops=3)
Sort Key: _age_default_alias_0.start_id
Sort Method: external merge Disk: 87984kB
Worker 0: Sort Method: external merge Disk: 83192kB
Worker 1: Sort Method: external merge Disk: 83280kB
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747 width=16)
(actual time=0.045..842.639 rows=3333333 loops=3)
-> Materialize (cost=0.42..18362.43 rows=644686 width=8) (actual time=0.059..560.710 rows=3933332 loops=3)
-> Index Only Scan using "User_id_idx" on "User" u (cost=0.42..16750.72 rows=644686 width=8) (actual time=0
.056..142.077 rows=644665 loops=3)
Heap Fetches: 0
-> Memoize (cost=0.01..1.54 rows=5 width=8) (actual time=0.001..0.001 rows=1 loops=10000000)
Cache Key: _age_default_alias_0.end_id
Cache Mode: logical
Hits: 3386794 Misses: 70936 Evictions: 0 Overflows: 0 Memory Usage: 7759kB
Worker 0: Hits: 3200070 Misses: 69560 Evictions: 0 Overflows: 0 Memory Usage: 7609kB
Worker 1: Hits: 3201422 Misses: 71218 Evictions: 0 Overflows: 0 Memory Usage: 7790kB
-> Append (cost=0.00..1.53 rows=5 width=8) (actual time=0.010..0.019 rows=1 loops=211714)
-> Seq Scan on _ag_label_vertex _age_default_alias_1_1 (cost=0.00..0.00 rows=1 width=8) (actual time=0.000.
.0.000 rows=0 loops=211714)
Filter: (id = _age_default_alias_0.end_id)
-> Index Only Scan using "User_id_idx" on "User" _age_default_alias_1_2 (cost=0.42..0.44 rows=1 width=8) (a
ctual time=0.004..0.004 rows=0 loops=211714)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using idx_book_id on "Book" _age_default_alias_1_3 (cost=0.29..0.31 rows=1 width=8) (act
ual time=0.004..0.004 rows=1 loops=211714)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using idx_author_id on "Author" _age_default_alias_1_4 (cost=0.29..0.31 rows=1 width=8)
(actual time=0.003..0.003 rows=0 loops=211714)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
-> Index Only Scan using idx_review_id on "Review" _age_default_alias_1_5 (cost=0.43..0.45 rows=1 width=8)
(actual time=0.005..0.005 rows=0 loops=211714)
Index Cond: (id = _age_default_alias_0.end_id)
Heap Fetches: 0
Planning Time: 1.860 ms
Execution Time: 10250.143 ms
(47 rows)
psql-17.5-5432-pgsql=#
Cypher query with Book endpoint using the new vars
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$explain analyze MATCH (u:User)-[:HAS_INTERACTION]->(:Book) RETURN count(*), u_idc ORDER BY count(*) DESC LIMIT 10 $$) AS (count agtype, u_idc agtype);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------
Subquery Scan on _ (cost=830343.44..830343.49 rows=10 width=64) (actual time=5091.056..5118.571 rows=10 loops=1)
-> Limit (cost=830343.44..830343.47 rows=10 width=40) (actual time=5091.052..5118.561 rows=10 loops=1)
-> Sort (cost=830343.44..831955.16 rows=644686 width=40) (actual time=5091.050..5118.558 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Finalize GroupAggregate (cost=651469.34..816412.01 rows=644686 width=40) (actual time=4851.058..5033.469 rows=132661 loops=1)
Group Key: u.id
-> Gather Merge (cost=651469.34..801906.57 rows=1289372 width=16) (actual time=4851.043..4917.321 rows=135554 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=650469.31..652081.03 rows=644686 width=16) (actual time=4840.942..4849.306 rows=45185 loops=3)
Sort Key: u.id
Sort Method: quicksort Memory: 2962kB
Worker 0: Sort Method: quicksort Memory: 2914kB
Worker 1: Sort Method: quicksort Memory: 2969kB
-> Partial HashAggregate (cost=538245.23..577244.80 rows=644686 width=16) (actual time=4803.703..4818.569 rows=45185 loops=3)
Group Key: u.id
Planned Partitions: 8 Batches: 1 Memory Usage: 6673kB
Worker 0: Batches: 1 Memory Usage: 6673kB
Worker 1: Batches: 1 Memory Usage: 6673kB
-> Parallel Hash Join (cost=19772.67..303865.71 rows=4166747 width=8) (actual time=2694.604..3822.223 rows=3333333 loop
s=3)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Hash Join (cost=2374.89..219877.18 rows=4166747 width=8) (actual time=17.696..1863.307 rows=3333333 l
oops=3)
Hash Cond: (_age_default_alias_0.end_id = _age_default_alias_1.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747 width=16)
(actual time=0.043..505.487 rows=3333333 loops=3)
-> Parallel Hash (cost=1888.44..1888.44 rows=38916 width=8) (actual time=16.889..16.890 rows=31133 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4736kB
-> Parallel Index Only Scan using idx_book_id on "Book" _age_default_alias_1 (cost=0.29..1888.44 rows
=38916 width=8) (actual time=0.068..7.123 rows=31133 loops=3)
Heap Fetches: 0
-> Parallel Hash (cost=12990.05..12990.05 rows=268619 width=8) (actual time=115.445..115.446 rows=214895 loops=3)
Buckets: 262144 (originally 262144) Batches: 16 (originally 8) Memory Usage: 12352kB
-> Parallel Index Only Scan using "User_id_idx" on "User" u (cost=0.42..12990.05 rows=268619 width=8) (actu
al time=0.056..51.793 rows=214895 loops=3)
Heap Fetches: 0
Planning Time: 1.453 ms
Execution Time: 5123.253 ms
(35 rows)
psql-17.5-5432-pgsql=#
@jrgemignani the results look amazing. Have you tried them with or without indexes?
@serdarmicrosoft Unfortunately, no. agefreighter adds in indexes and it doesn't look like indexes can be removed. Or, at least that is my understanding.
edit: I misspoke, indexes can be removed. I am removing them and testing the performance without.
@serdarmicrosoft I removed the gin indexes on properties for User, Book, Author, and Review. As there wasn't a field using them, there wasn't an impact.
@serdarmicrosoft I removed the btree indexes on id for User, Book, Author, and Review. There wasn't any noticeable difference.
@serdarmicrosoft I removed the rest leaving -
psql-17.5-5432-pgsql=# select tablename,indexname,indexdef from pg_indexes where schemaname = 'goodreads_graph_x'; tablename | indexname | indexdef
------------------+-----------------------+--------------------------------------------------------------------------------------------------
_ag_label_vertex | _ag_label_vertex_pkey | CREATE UNIQUE INDEX _ag_label_vertex_pkey ON goodreads_graph_x._ag_label_vertex USING btree (id)
_ag_label_edge | _ag_label_edge_pkey | CREATE UNIQUE INDEX _ag_label_edge_pkey ON goodreads_graph_x._ag_label_edge USING btree (id)
(2 rows)
SQL
psql-17.5-5432-pgsql=# explain analyze select count(*), u.id from goodreads_graph_x."User" u, goodreads_graph_x."HAS_INTERACTION" h, goodreads_graph_x."Book" b where u.id = h.start_id and b.id = h.end_id GROUP BY u.id ORDER BY count(*) DESC LIMIT 10;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------
Limit (cost=838170.89..838170.92 rows=10 width=16) (actual time=4981.262..5005.469 rows=10 loops=1)
-> Sort (cost=838170.89..839782.61 rows=644686 width=16) (actual time=4981.260..5005.466 rows=10 loops=1)
Sort Key: (count(*)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> Finalize GroupAggregate (cost=660908.50..824239.46 rows=644686 width=16) (actual time=4854.597..4977.144 rows=132661 loops=1)
Group Key: u.id
-> Gather Merge (cost=660908.50..811345.74 rows=1289372 width=16) (actual time=4854.587..4916.484 rows=135688 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=659908.48..661520.19 rows=644686 width=16) (actual time=4848.318..4855.496 rows=45229 loops=3)
Sort Key: u.id
Sort Method: quicksort Memory: 3022kB
Worker 0: Sort Method: quicksort Memory: 2916kB
Worker 1: Sort Method: quicksort Memory: 2912kB
-> Partial HashAggregate (cost=547684.39..586683.96 rows=644686 width=16) (actual time=4812.863..4827.311 rows=4522
9 loops=3)
Group Key: u.id
Planned Partitions: 8 Batches: 1 Memory Usage: 6929kB
Worker 0: Batches: 1 Memory Usage: 6673kB
Worker 1: Batches: 1 Memory Usage: 6673kB
-> Parallel Hash Join (cost=22063.08..313304.87 rows=4166747 width=8) (actual time=2693.933..3832.376 rows=33
33333 loops=3)
Hash Cond: (h.start_id = u.id)
-> Parallel Hash Join (cost=3853.15..228504.20 rows=4166747 width=8) (actual time=18.426..1859.249 rows
=3333333 loops=3)
Hash Cond: (h.end_id = b.id)
-> Parallel Seq Scan on "HAS_INTERACTION" h (cost=0.00..184515.47 rows=4166747 width=16) (actual
time=0.034..500.278 rows=3333333 loops=3)
-> Parallel Hash (cost=3166.40..3166.40 rows=54940 width=8) (actual time=17.692..17.693 rows=3113
3 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4736kB
-> Parallel Seq Scan on "Book" b (cost=0.00..3166.40 rows=54940 width=8) (actual time=0.009
..8.255 rows=31133 loops=3)
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=8) (actual time=118.693..118.694 rows=21489
5 loops=3)
Buckets: 262144 (originally 262144) Batches: 16 (originally 8) Memory Usage: 12384kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=8) (actual time=0.049..55
.689 rows=214895 loops=3)
Planning Time: 0.411 ms
Execution Time: 5007.980 ms
(32 rows)
psql-17.5-5432-pgsql=#
u_idc
psql-17.5-5432-pgsql=# SELECT * FROM cypher('goodreads_graph_x', $$ explain analyze MATCH (u:User)-[:HAS_INTERACTION]->(:Book) RETURN count(*), u_idc ORDER BY count(*) DESC LIMIT 10 $$) AS (count agtype, id agtype);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------
Subquery Scan on _ (cost=839782.61..839782.66 rows=10 width=64) (actual time=5105.792..5129.789 rows=10 loops=1)
-> Limit (cost=839782.61..839782.63 rows=10 width=40) (actual time=5105.788..5129.780 rows=10 loops=1)
-> Sort (cost=839782.61..841394.32 rows=644686 width=40) (actual time=5105.786..5129.777 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Finalize GroupAggregate (cost=660908.50..825851.17 rows=644686 width=40) (actual time=4855.134..5041.501 rows=132661 loops=1
)
Group Key: u.id
-> Gather Merge (cost=660908.50..811345.74 rows=1289372 width=16) (actual time=4855.120..4919.445 rows=134560 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=659908.48..661520.19 rows=644686 width=16) (actual time=4848.028..4856.092 rows=44853 loops=3)
Sort Key: u.id
Sort Method: quicksort Memory: 2922kB
Worker 0: Sort Method: quicksort Memory: 2907kB
Worker 1: Sort Method: quicksort Memory: 2986kB
-> Partial HashAggregate (cost=547684.39..586683.96 rows=644686 width=16) (actual time=4813.103..4827.149 row
s=44853 loops=3)
Group Key: u.id
Planned Partitions: 8 Batches: 1 Memory Usage: 6673kB
Worker 0: Batches: 1 Memory Usage: 6673kB
Worker 1: Batches: 1 Memory Usage: 6929kB
-> Parallel Hash Join (cost=22063.08..313304.87 rows=4166747 width=8) (actual time=2698.029..3826.705 r
ows=3333333 loops=3)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Hash Join (cost=3853.15..228504.20 rows=4166747 width=8) (actual time=18.387..1856.38
1 rows=3333333 loops=3)
Hash Cond: (_age_default_alias_0.end_id = _age_default_alias_1.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4
166747 width=16) (actual time=0.034..501.547 rows=3333333 loops=3)
-> Parallel Hash (cost=3166.40..3166.40 rows=54940 width=8) (actual time=17.673..17.674 row
s=31133 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4704kB
-> Parallel Seq Scan on "Book" _age_default_alias_1 (cost=0.00..3166.40 rows=54940 wi
dth=8) (actual time=0.018..8.248 rows=31133 loops=3)
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=8) (actual time=127.643..127.644 rows
=214895 loops=3)
Buckets: 262144 (originally 262144) Batches: 16 (originally 8) Memory Usage: 12352kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=8) (actual time=0.0
42..56.620 rows=214895 loops=3)
Planning Time: 0.481 ms
Execution Time: 5132.179 ms
(33 rows)
psql-17.5-5432-pgsql=#
id(u)
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------
Limit (cost=1016207.67..1016207.69 rows=10 width=64) (actual time=17369.620..17386.003 rows=10 loops=1)
-> Sort (cost=1016207.67..1017819.38 rows=644686 width=64) (actual time=17369.618..17386.000 rows=10 loops=1)
Sort Key: ((count(*))::agtype) DESC
Sort Method: top-N heapsort Memory: 26kB
-> Finalize GroupAggregate (cost=832498.42..1002276.23 rows=644686 width=64) (actual time=16953.437..17306.058 rows=132661 loops=1)
Group Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
-> Gather Merge (cost=832498.42..982935.65 rows=1289372 width=40) (actual time=16953.419..17099.214 rows=132995 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=831498.39..833110.11 rows=644686 width=40) (actual time=16943.086..16950.235 rows=44332 loops=3)
Sort Key: (age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties)))
Sort Method: quicksort Memory: 3384kB
Worker 0: Sort Method: quicksort Memory: 3440kB
Worker 1: Sort Method: quicksort Memory: 3507kB
-> Partial HashAggregate (cost=683413.13..751662.38 rows=644686 width=40) (actual time=16656.683..16671.162 rows=44
332 loops=3)
Group Key: age_id(_agtype_build_vertex(u.id, _label_name('237668'::oid, u.id), u.properties))
Planned Partitions: 16 Batches: 1 Memory Usage: 6417kB
Worker 0: Batches: 1 Memory Usage: 6417kB
Worker 1: Batches: 1 Memory Usage: 6673kB
-> Parallel Hash Join (cost=25473.08..351375.47 rows=4166747 width=32) (actual time=2687.049..13682.043 rows=
3333333 loops=3)
Hash Cond: (_age_default_alias_0.start_id = u.id)
-> Parallel Hash Join (cost=3853.15..228504.20 rows=4166747 width=8) (actual time=18.775..1868.630 rows
=3333333 loops=3)
Hash Cond: (_age_default_alias_0.end_id = _age_default_alias_1.id)
-> Parallel Seq Scan on "HAS_INTERACTION" _age_default_alias_0 (cost=0.00..184515.47 rows=4166747
width=16) (actual time=0.032..504.195 rows=3333333 loops=3)
-> Parallel Hash (cost=3166.40..3166.40 rows=54940 width=8) (actual time=18.347..18.348 rows=3113
3 loops=3)
Buckets: 131072 Batches: 1 Memory Usage: 4704kB
-> Parallel Seq Scan on "Book" _age_default_alias_1 (cost=0.00..3166.40 rows=54940 width=8)
(actual time=0.010..8.536 rows=31133 loops=3)
-> Parallel Hash (cost=13802.19..13802.19 rows=268619 width=110) (actual time=111.960..111.961 rows=214
895 loops=3)
Buckets: 65536 (originally 65536) Batches: 32 (originally 16) Memory Usage: 9824kB
-> Parallel Seq Scan on "User" u (cost=0.00..13802.19 rows=268619 width=110) (actual time=0.045..
38.799 rows=214895 loops=3)
Planning Time: 0.448 ms
Execution Time: 17388.427 ms
(32 rows)
psql-17.5-5432-pgsql=#
Not much difference. This could be because the data was loaded in an orderly way making indexes unnecessary?
@serdarmicrosoft After doing some research, I'm not sure we can improve this particular query any further with indexes. As it is using hash joins -
In PostgreSQL, a hash join typically ignores indexes because it is designed to use a different strategy for joining tables. Here's why this happens and how it works:
Why Hash Joins Ignore Indexes Hash Table Construction:
A hash join creates a hash table in memory for the smaller table (or the inner table in the join). This hash table maps the join key to the corresponding rows. The larger table (outer table) is then scanned, and its rows are matched against the hash table.
I think the only target left would be the aggregation functions.