Memory leak
Describe the bug
Potential memory leak
How are you accessing AGE (Command line, driver, etc.)?
- JDBC
What data setup do we need to do?
select ag_catalog.create_graph('test_graph');
--- create test labels ---
select ag_catalog.create_vlabel('test_graph','domain');
select ag_catalog.create_vlabel('test_graph','user');
--- create test relation ---
select ag_catalog.create_elabel('test_graph','edges');
--- generate test data, should take about 2 minutes ---
DO
$do$
BEGIN
FOR i IN 1..8000 LOOP
EXECUTE format('
select * from ag_catalog.cypher(''test_graph'',
$$
CREATE (Org_level_1_%s: domain {id:"Org_level_1_%s", type:"Organization", name: "Org_level_1_%s some real name that could looooooooooooooong" ,hidden:false })
CREATE (Org_level_2_%s: domain {id:"Org_level_2_%s", type:"Organization", name: "Org_level_2_%s some real name that could looooooooooooooong" ,hidden:false })
CREATE (Org_level_3_%s: domain {id:"Org_level_3_%s", type:"Organization", name: "Org_level_3_%s some real name that could looooooooooooooong" ,hidden:false })
CREATE (Org_level_4_%s: domain {id:"Org_level_4_%s", type:"Organization", name: "Org_level_4_%s some real name that could looooooooooooooong" ,hidden:false })
CREATE (Dep_level_5_%s: domain {id:"Dep_level_5_%s", type:"Department", name: "Dep_level_5_%s some real name that could looooooooooooooong" ,hidden:false })
CREATE (User_%s: user { id:"User_%s", type:"User", name: "user_%s" ,hidden:false })
CREATE (User_%s)-[_rel_user:edges { from: ''User_%s'', to: ''Dep_level_5_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''user'', relation: ''user-domain'' }]->(Dep_level_5_%s)
CREATE (Dep_level_5_%s)-[_rel1:edges { from: ''Dep_level_5_%s'', to: ''Org_level_4_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_4_%s)
CREATE (Org_level_4_%s)-[_rel2:edges { from: ''Org_level_4_%s'', to: ''Org_level_3_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_3_%s)
CREATE (Org_level_3_%s)-[_rel3:edges { from: ''Org_level_3_%s'', to: ''Org_level_2_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_2_%s)
CREATE (Org_level_2_%s)-[_rel4:edges { from: ''Org_level_2_%s'', to: ''Org_level_1_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_1_%s)
$$
) as (any_vertex ag_catalog.agtype)
', VARIADIC ARRAY(SELECT array_fill('iter_' || i::text, '{100}')) );
END LOOP;
END
$do$;
--- create test label GIN index ---
CREATE INDEX domain_gin_idx ON test_graph.domain USING GIN (properties);
CREATE INDEX user_gin_idx ON test_graph.user USING GIN (properties);
CREATE INDEX edges_gin_idx ON test_graph.edges USING GIN (properties);
What is the command that caused the error?
-- run all three scripts below consistently multiple times and observe DB memory usage
-- create some data
select * from ag_catalog.cypher('test_graph', $$
CREATE (d: domain { id: "test1", type: "Organization", name: "test1" , hidden: false } )
RETURN d $$
) as (vertex agtype);
-- delete some data
select * from ag_catalog.cypher('test_graph',
$$
MATCH (any_vertex: domain { id: "test1" } )
DELETE any_vertex
RETURN any_vertex
$$) as (catalog ag_catalog.agtype);
-- run query
select
domain
from
ag_catalog.cypher('test_graph',$$
MATCH (user_vertex: user { hidden: false, id: 'User_iter_100'})
-[e1:edges {hidden: false, fromCollection: 'user', toCollection: 'domain', relation: 'user-domain'}]->(to1: domain {hidden: false})
-[e2:edges*0..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'parent'}]->(to2: domain {hidden: false})
RETURN to2
$$) as domain(domain ag_catalog.agtype);
Database went into recovery mode, could be seen as an IO error on the client side.
Expected behavior
No server crash happens
Environment (please complete the following information):
- Postgres 14 + Apache AGE 1.5.0
Additional context
It looks like any Cypher script's result is being cached after the first run. The second run and all subsequent runs are faster unless any data has been created/updated/deleted. If create/update/delete happens the scripts executes slower again. This leads to increased DB memory consumption that never released.
For our real data (about 20 vertex labels with different connectedness over 3 million edges) similar script such as below executes around 5 seconds first time, around 300ms second time until create/update/delete happens.
select
domain
from
ag_catalog.cypher('test_graph',$$
MATCH (user_vertex: user { hidden: false, id: 'User_iter_100'})
-[e1:edges {hidden: false, fromCollection: 'user', toCollection: 'domain', relation: 'user-domain'}]->(to1: domain {hidden: false})
-[e2:edges*0..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'parent'}]->(to2: domain {hidden: false})
RETURN to2
$$) as domain(domain ag_catalog.agtype);
We used docker container with max 512 RAM available for it to reproduce this behaviour with provided test case.
Please take a look at the attached screenshot.
@vladiksun Yes, the VLE component does cache data for multiple contexts (graphs). It will only free them when the -
- Transaction id changes (updates).
- Maximum number of cached contexts is reached and a new one is created, one is removed. (LRU cache if I recall).
- delete_global_graphs function is called to purge unused graphs.
- session exits.
But, it shouldn't crash. I will take a look.
@vladiksun I have not been able to reproduce a crash, but will keep trying. However, I am using the latest master branch (PG16) and there have been fixes to the VLE caching mechanism since version 1.5.0 to address crashes. If you would like to test against the latest, unsupported image, try dev_snapshot_<version> For you, that would be, dev_snapshot_PG14.
@vladiksun I do see the memory leak you are talking about and have narrowed it down to a few commands.
@vladiksun PR #2028 Should address most of the issue with this memory leak. Although, there might be more, more subtle leaks.
@jrgemignani thanks for taking a look. We are waiting when it is ready then.
@vladiksun It is in the master branch right now. I'll be making PRs for the other branches today.
@vladiksun This fix is now in all branches. If you want to test the latest for the branch you are using, try dev_snapshot_<version>. For example, dev_snapshot_P14.
@jrgemignani I checked the fix. It works within one session. As I understand with more sessions there would be more memory consumption to keep the cache. For example using 100 connections via pg_bouncer means we would be out of memory pretty soon. Is it right ? In our case we have approx. 40 thousand vertices with hierarchical relations. So typical hierarchical query before the cache takes about 5 seconds wich is too long anyway.
I am curious why delete_global_graphs does not clear all the cache.
@vladiksun There are some other memory leaks, for example #2046 which I am currently working on, that could be contributing factors. That particular one highlights quite a few other areas that need corrective action as well. This week I will be focusing on getting patches out to deal with those.
I am hesitant to suggest that this will, or will not, fix memory issues, due to the complexity of PostgreSQL's memory system (contexts and caches). However, I will note that in my debugging of these memory leaks, I have found that PostgreSQL doesn't like to give back memory even after it is freed. It seems that once it gets it from the system, it may hold onto it for a while, just in case.
Also, keep in mind that PostgreSQL itself will cache items if it feels caching will improve performance.
This issue is stale because it has been open 60 days with no activity. Remove "Abondoned" label or comment or this will be closed in 14 days.
This issue was closed because it has been stalled for further 14 days with no activity.