age icon indicating copy to clipboard operation
age copied to clipboard

Memory leak

Open vladiksun opened this issue 1 year ago • 10 comments

Describe the bug

Potential memory leak

How are you accessing AGE (Command line, driver, etc.)?

  • JDBC

What data setup do we need to do?


select ag_catalog.create_graph('test_graph');

--- create test labels ---
select ag_catalog.create_vlabel('test_graph','domain');
select ag_catalog.create_vlabel('test_graph','user');

--- create test relation ---
select ag_catalog.create_elabel('test_graph','edges');

--- generate test data, should take about 2 minutes ---
DO
$do$
    BEGIN
        FOR i IN 1..8000 LOOP
                EXECUTE format('
                select * from ag_catalog.cypher(''test_graph'',
                $$
                    CREATE (Org_level_1_%s: domain {id:"Org_level_1_%s", type:"Organization", name: "Org_level_1_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Org_level_2_%s: domain {id:"Org_level_2_%s", type:"Organization", name: "Org_level_2_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Org_level_3_%s: domain {id:"Org_level_3_%s", type:"Organization", name: "Org_level_3_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Org_level_4_%s: domain {id:"Org_level_4_%s", type:"Organization", name: "Org_level_4_%s some real name that could looooooooooooooong" ,hidden:false })
                    CREATE (Dep_level_5_%s: domain {id:"Dep_level_5_%s", type:"Department", name: "Dep_level_5_%s some real name that could looooooooooooooong" ,hidden:false })

                    CREATE (User_%s: user { id:"User_%s", type:"User", name: "user_%s" ,hidden:false })

                    CREATE (User_%s)-[_rel_user:edges { from: ''User_%s'', to: ''Dep_level_5_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''user'', relation: ''user-domain'' }]->(Dep_level_5_%s)

                    CREATE (Dep_level_5_%s)-[_rel1:edges { from: ''Dep_level_5_%s'', to: ''Org_level_4_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_4_%s)
                    CREATE (Org_level_4_%s)-[_rel2:edges { from: ''Org_level_4_%s'', to: ''Org_level_3_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_3_%s)
                    CREATE (Org_level_3_%s)-[_rel3:edges { from: ''Org_level_3_%s'', to: ''Org_level_2_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_2_%s)
                    CREATE (Org_level_2_%s)-[_rel4:edges { from: ''Org_level_2_%s'', to: ''Org_level_1_%s'', hidden: false, toCollection: ''domain'', fromCollection: ''domain'', relation: ''parent'' }]->(Org_level_1_%s)
                $$
            ) as (any_vertex ag_catalog.agtype)
            ', VARIADIC ARRAY(SELECT array_fill('iter_' || i::text, '{100}')) );
            END LOOP;
    END
$do$;

--- create test label GIN index ---
CREATE INDEX domain_gin_idx ON test_graph.domain USING GIN (properties);
CREATE INDEX user_gin_idx ON test_graph.user USING GIN (properties);

CREATE INDEX edges_gin_idx ON test_graph.edges USING GIN (properties);

What is the command that caused the error?


-- run all three scripts below consistently multiple times and observe DB memory usage

-- create some data
select * from ag_catalog.cypher('test_graph', $$
    CREATE (d: domain { id: "test1", type: "Organization", name: "test1" , hidden: false } )
    RETURN d $$
) as (vertex agtype);

-- delete some data
select * from ag_catalog.cypher('test_graph',
                                $$
    MATCH (any_vertex: domain { id: "test1" } )
    DELETE any_vertex
    RETURN any_vertex
$$) as (catalog ag_catalog.agtype);

-- run query
select
    domain
from
    ag_catalog.cypher('test_graph',$$

        MATCH (user_vertex: user { hidden: false, id: 'User_iter_100'})
            -[e1:edges  {hidden: false, fromCollection: 'user', toCollection: 'domain', relation: 'user-domain'}]->(to1: domain {hidden: false})
            -[e2:edges*0..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'parent'}]->(to2: domain {hidden: false})

        RETURN to2
$$) as domain(domain ag_catalog.agtype);
Database went into recovery mode, could be seen as an IO error on the client side.

Expected behavior

No server crash happens

Environment (please complete the following information):

  • Postgres 14 + Apache AGE 1.5.0

Additional context

It looks like any Cypher script's result is being cached after the first run. The second run and all subsequent runs are faster unless any data has been created/updated/deleted. If create/update/delete happens the scripts executes slower again. This leads to increased DB memory consumption that never released.

For our real data (about 20 vertex labels with different connectedness over 3 million edges) similar script such as below executes around 5 seconds first time, around 300ms second time until create/update/delete happens.

select
    domain
from
    ag_catalog.cypher('test_graph',$$

        MATCH (user_vertex: user { hidden: false, id: 'User_iter_100'})
            -[e1:edges  {hidden: false, fromCollection: 'user', toCollection: 'domain', relation: 'user-domain'}]->(to1: domain {hidden: false})
            -[e2:edges*0..10 {hidden: false, fromCollection: 'domain', toCollection: 'domain', relation: 'parent'}]->(to2: domain {hidden: false})

        RETURN to2
$$) as domain(domain ag_catalog.agtype);

We used docker container with max 512 RAM available for it to reproduce this behaviour with provided test case.

Please take a look at the attached screenshot. memory_leak

vladiksun avatar Aug 08 '24 10:08 vladiksun

@vladiksun Yes, the VLE component does cache data for multiple contexts (graphs). It will only free them when the -

  1. Transaction id changes (updates).
  2. Maximum number of cached contexts is reached and a new one is created, one is removed. (LRU cache if I recall).
  3. delete_global_graphs function is called to purge unused graphs.
  4. session exits.

But, it shouldn't crash. I will take a look.

jrgemignani avatar Aug 08 '24 16:08 jrgemignani

@vladiksun I have not been able to reproduce a crash, but will keep trying. However, I am using the latest master branch (PG16) and there have been fixes to the VLE caching mechanism since version 1.5.0 to address crashes. If you would like to test against the latest, unsupported image, try dev_snapshot_<version> For you, that would be, dev_snapshot_PG14.

jrgemignani avatar Aug 08 '24 18:08 jrgemignani

@vladiksun I do see the memory leak you are talking about and have narrowed it down to a few commands.

jrgemignani avatar Aug 08 '24 20:08 jrgemignani

@vladiksun PR #2028 Should address most of the issue with this memory leak. Although, there might be more, more subtle leaks.

jrgemignani avatar Aug 09 '24 00:08 jrgemignani

@jrgemignani thanks for taking a look. We are waiting when it is ready then.

vladiksun avatar Aug 09 '24 08:08 vladiksun

@vladiksun It is in the master branch right now. I'll be making PRs for the other branches today.

jrgemignani avatar Aug 09 '24 14:08 jrgemignani

@vladiksun This fix is now in all branches. If you want to test the latest for the branch you are using, try dev_snapshot_<version>. For example, dev_snapshot_P14.

jrgemignani avatar Aug 09 '24 17:08 jrgemignani

@jrgemignani I checked the fix. It works within one session. As I understand with more sessions there would be more memory consumption to keep the cache. For example using 100 connections via pg_bouncer means we would be out of memory pretty soon. Is it right ? In our case we have approx. 40 thousand vertices with hierarchical relations. So typical hierarchical query before the cache takes about 5 seconds wich is too long anyway.

I am curious why delete_global_graphs does not clear all the cache. cypher_memory_consumption

vladiksun avatar Aug 19 '24 14:08 vladiksun

@vladiksun There are some other memory leaks, for example #2046 which I am currently working on, that could be contributing factors. That particular one highlights quite a few other areas that need corrective action as well. This week I will be focusing on getting patches out to deal with those.

I am hesitant to suggest that this will, or will not, fix memory issues, due to the complexity of PostgreSQL's memory system (contexts and caches). However, I will note that in my debugging of these memory leaks, I have found that PostgreSQL doesn't like to give back memory even after it is freed. It seems that once it gets it from the system, it may hold onto it for a while, just in case.

Also, keep in mind that PostgreSQL itself will cache items if it feels caching will improve performance.

jrgemignani avatar Aug 19 '24 15:08 jrgemignani

This issue is stale because it has been open 60 days with no activity. Remove "Abondoned" label or comment or this will be closed in 14 days.

github-actions[bot] avatar Oct 19 '24 00:10 github-actions[bot]

This issue was closed because it has been stalled for further 14 days with no activity.

github-actions[bot] avatar Nov 02 '24 00:11 github-actions[bot]