pgsync icon indicating copy to clipboard operation
pgsync copied to clipboard

Unused Function causing Memory Leak?

Open markhalonen opened this issue 3 years ago • 23 comments

PGSync version: master

Postgres version: 13.7

Elasticsearch version: 8.3.2

Redis version: 7

Python version: 3.8.10

Problem Description:

We have a memory leak we're trying to track down.

Can I ask what the utility of this function is: https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/base.py#L402

It seems to only issue a select statement and return the results, but the callers doesn't use the results: https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/sync.py#L441

https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/sync.py#L1242

If I just replace the body of this function with return 1, I don't get any errors and it appears to at least improve the memory leak.

Potentially related? https://github.com/sqlalchemy/sqlalchemy/issues/7875

@zacherkkila @bhalonen

markhalonen avatar Jul 26 '22 04:07 markhalonen

Hi @markhalonen

  • You are right logical_slot_get_changes does not use the results
  • I will change that now
  • We are calling something like SELECT * FROM PG_LOGICAL_SLOT_GET_CHANGES('testdb', NULL, NULL) but we don't care about the results because we are just trying to clear those processed rows from the logical decoding logs.
  • This is further described here

toluaina avatar Jul 26 '22 15:07 toluaina

Ok, will be interested to see. Maybe instead of select * we could try select txid or something?

markhalonen avatar Jul 26 '22 15:07 markhalonen

We could simply replace this line with self.execute(statement) and that should have the same effect. And yes we can also limit to a single column such as txid

toluaina avatar Jul 26 '22 15:07 toluaina

Ok so I pushed some changes in cec8e13 commit . Can you let me know if that has the same effect on memory usage

toluaina avatar Jul 26 '22 16:07 toluaina

There is no effect from this change on memory usage.

bhalonen avatar Jul 27 '22 13:07 bhalonen

I found the cause of this. The Node class is created multiple times and needs to implement a singleton pattern. I'll have a fix for this soon

toluaina avatar Jul 28 '22 08:07 toluaina

This has been addressed in the latest commit to master. I'm still doing some testing but you try it out and let me know please.

toluaina avatar Jul 28 '22 14:07 toluaina

@toluaina we are still seeing memory leaking with the lastest commit. We have a somewhat larger nester schema, so query size or query caching could be the source of the leak.

bhalonen avatar Jul 28 '22 18:07 bhalonen

@toluaina we did not see leaking with a flat structure and we've also seen slow leaks on an inactive db, building up over 24 hours. not sure if that helps

bhalonen avatar Jul 28 '22 19:07 bhalonen

this helps a lot thanks.

toluaina avatar Jul 29 '22 12:07 toluaina

@toluaina it is possible it was still leaking with a flat structure, but certainly much much slower.

bhalonen avatar Jul 29 '22 13:07 bhalonen

@toluaina https://github.com/bhalonen/pgsync-mwe-leak this example absolutely leaks memory slowly over time, the spam.sh loading causes memory to ramp but I'm not sure if it leave uncollected memory. It isn't as high of a quality of reproduction as imaginable, but I did see steady state memory usage climb 20% over several hours.

bhalonen avatar Jul 30 '22 03:07 bhalonen

  • I spent some time trying to reproduce this.
  • Memory does increase but peaks and remains steady in my case studies.
  • I noticed your Dockerfile is installing pgsync from pypi meaning you were not using the changes on master branch.
  • In any case, I have just published 2.3.2 with the latest changes on master.
  • I also did some profiling and was not able to track any resource leak.
  • I am going to try memray to see if it highlights any difference from the tools I've been using,

toluaina avatar Jul 31 '22 22:07 toluaina

I have also run it with master and also had memory leaks, (mounted file system, cloned, ran make install) I think I can get a higher quality reproduction that isn't so slow... I'll also verify it isn't fixed on this release.

bhalonen avatar Jul 31 '22 22:07 bhalonen

You did have some more changes besides the one I tested.

bhalonen avatar Jul 31 '22 22:07 bhalonen

Correct I had some other minor changes which should not have too much impact.

toluaina avatar Aug 01 '22 15:08 toluaina

@toluaina After testing today, we still have a leak. All my attempts to attach memray resulted in segmantation faults, with no interesting trace result in gdb. However, I was playing around with async mode, and noticed it does --not-- leak memory, but it does not actually update elastic either.

this is example logs for async under a write load that sync a) updates search results for, and b) leaks memory. Async does neither... Not sure if this helps. will try to make the example leak fast.

Async steelhead Xlog brent ver: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [4] => Redis: [total = 4 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [4] => Redis: [total = 4 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [73] => Redis: [total = 72 pending = 1] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [542] => Redis: [total = 542 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [1,081] => Redis: [total = 1,081 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [1,594] => Redis: [total = 1,586 pending = 8] => Elastic: [0] ...

bhalonen avatar Aug 01 '22 20:08 bhalonen

I am still doing some in depth analysis to asses the cause of any leaks.

Thats a little strange. I just pulled the latest version and all changes were propagated to ES. Is Elasticsearch completely empty? Can you try deleting the state/checkpoint file and re-run. This should be the hidden file in the current directory with the name of the database and index

toluaina avatar Aug 01 '22 20:08 toluaina

Actually I can see the elastic document document count is still at 0. That value should be increasing with each CRUD operation to the database.

toluaina avatar Aug 01 '22 20:08 toluaina

Found a way to drive memory leaks: incorrect specification in file format, if a "one_to_many" is improperly marked as a "one_to_one". I found this through random stress testing. This makes the leak go at warp speed, but I do not believe that is currently our problem.

I am currently wondering if "many_to_one" is a problem as well. Can for example on a github issue, a comment belongs to one issue, an example of "one_to_many". but one commentor can belong to many issues, what is that classified as in pgsync? Is this relationship "legal"? We do many searches like that, and could explain our leak.

On your questions: ES was not totally empty as I ran a single pass sync version to get some results in. It also didn't put any results in if run without the prior sync. Our typical start up procedure deletes all of the checkpoint files and all elastic data for a clean start up during dev.

this is an example of our most boring possible search index

{
    "database": "steelhead",
    "index": "steelhead_search_part_number",
    "setting": {
      "analysis": {
        "analyzer": {
          "ngram_analyzer": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "ngram_tokenizer"
          }
        },
        "tokenizer": {
          "ngram_tokenizer": {
            "token_chars": [
              "letter",
              "digit",
              "punctuation",
              "symbol"
            ],
            "type": "ngram",
            "min_gram": "2",
            "max_gram": "10"
          }
        }
      },
      "max_ngram_diff": "8"
    },
    "plugins": [
      "IdInDomain"
    ],
    "nodes": {
      "table": "part_number",
      "schema": "steelhead",
      "columns": [
        "name",
        "created_at",
        "description_markdown"
      ],
      "children": [
        {
          "table": "user",
          "schema": "steelhead",
          "columns": [
            "name",
            "id",
            "created_at"
          ],
          "relationship": {
            "variant": "object",
            "type": "one_to_one"
          },
          "children": [
            {
              "table": "domain",
              "schema": "steelhead",
              "columns": [
                "id"
              ],
              "relationship": {
                "variant": "object",
                "type": "one_to_one"
              }
            }
          ]
        }
      ]
    }
  },

bhalonen avatar Aug 02 '22 02:08 bhalonen

I created a working branch to study this in more detail. I also have a script to insert random records So far memory peaks at about 640MB. I think the relationship you described would effectively be a many to many and many to many are not supported. I'll trying and play around with the relationship types to see the effect.

toluaina avatar Aug 03 '22 10:08 toluaina

Apologies for the delay. This has required a bit of effort. I have identified the major cause for this. SQLAlchemy stores the SQL compilation cache on the engine - SQL Compilation Caching. This is intended for performance reason when subsequent queries are run. I am now clearing this cache on each iteration. This has all been merged into the main branch.

toluaina avatar Aug 25 '22 19:08 toluaina