pgsync
                                
                                
                                
                                    pgsync copied to clipboard
                            
                            
                            
                        Unused Function causing Memory Leak?
PGSync version: master
Postgres version: 13.7
Elasticsearch version: 8.3.2
Redis version: 7
Python version: 3.8.10
Problem Description:
We have a memory leak we're trying to track down.
Can I ask what the utility of this function is: https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/base.py#L402
It seems to only issue a select statement and return the results, but the callers doesn't use the results: https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/sync.py#L441
https://github.com/toluaina/pgsync/blob/1489a24815454800a230355fafadf77c9ac7442f/pgsync/sync.py#L1242
If I just replace the body of this function with return 1, I don't get any errors and it appears to at least improve the memory leak.
Potentially related? https://github.com/sqlalchemy/sqlalchemy/issues/7875
@zacherkkila @bhalonen
Hi @markhalonen
- You are right logical_slot_get_changes does not use the results
 - I will change that now
 - We are calling something like 
SELECT * FROM PG_LOGICAL_SLOT_GET_CHANGES('testdb', NULL, NULL)but we don't care about the results because we are just trying to clear those processed rows from the logical decoding logs. - This is further described here
 
Ok, will be interested to see. Maybe instead of select * we could try select txid or something?
We could simply replace this line with
self.execute(statement)  and that should have the same effect.
And yes we can also limit to a single column such as txid
Ok so I pushed some changes in cec8e13 commit . Can you let me know if that has the same effect on memory usage
There is no effect from this change on memory usage.
I found the cause of this. The Node class is created multiple times and needs to implement a singleton pattern. I'll have a fix for this soon
This has been addressed in the latest commit to master. I'm still doing some testing but you try it out and let me know please.
@toluaina we are still seeing memory leaking with the lastest commit. We have a somewhat larger nester schema, so query size or query caching could be the source of the leak.
@toluaina we did not see leaking with a flat structure and we've also seen slow leaks on an inactive db, building up over 24 hours. not sure if that helps
this helps a lot thanks.
@toluaina it is possible it was still leaking with a flat structure, but certainly much much slower.
@toluaina https://github.com/bhalonen/pgsync-mwe-leak this example absolutely leaks memory slowly over time, the spam.sh loading causes memory to ramp but I'm not sure if it leave uncollected memory. It isn't as high of a quality of reproduction as imaginable, but I did see steady state memory usage climb 20% over several hours.
- I spent some time trying to reproduce this.
 - Memory does increase but peaks and remains steady in my case studies.
 - I noticed your Dockerfile is installing pgsync from pypi meaning you were not using the changes on master branch.
 - In any case, I have just published 2.3.2 with the latest changes on master.
 - I also did some profiling and was not able to track any resource leak.
 - I am going to try memray to see if it highlights any difference from the tools I've been using,
 
I have also run it with master and also had memory leaks, (mounted file system, cloned, ran make install) I think I can get a higher quality reproduction that isn't so slow... I'll also verify it isn't fixed on this release.
You did have some more changes besides the one I tested.
Correct I had some other minor changes which should not have too much impact.
@toluaina After testing today, we still have a leak. All my attempts to attach memray resulted in segmantation faults, with no interesting trace result in gdb. However, I was playing around with async mode, and noticed it does --not-- leak memory, but it does not actually update elastic either.
this is example logs for async under a write load that sync a) updates search results for, and b) leaks memory. Async does neither... Not sure if this helps. will try to make the example leak fast.
Async steelhead Xlog brent ver: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [4] => Redis: [total = 4 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [4] => Redis: [total = 4 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [73] => Redis: [total = 72 pending = 1] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [542] => Redis: [total = 542 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [1,081] => Redis: [total = 1,081 pending = 0] => Elastic: [0] ...
Async steelhead Xlog brent ver: [0] => Db: [1,594] => Redis: [total = 1,586 pending = 8] => Elastic: [0] ...
                                    
                                    
                                    
                                
I am still doing some in depth analysis to asses the cause of any leaks.
Thats a little strange. I just pulled the latest version and all changes were propagated to ES. Is Elasticsearch completely empty? Can you try deleting the state/checkpoint file and re-run. This should be the hidden file in the current directory with the name of the database and index
Actually I can see the elastic document document count is still at 0. That value should be increasing with each CRUD operation to the database.
Found a way to drive memory leaks: incorrect specification in file format, if a "one_to_many" is improperly marked as a "one_to_one". I found this through random stress testing. This makes the leak go at warp speed, but I do not believe that is currently our problem.
I am currently wondering if "many_to_one" is a problem as well. Can for example on a github issue, a comment belongs to one issue, an example of "one_to_many". but one commentor can belong to many issues, what is that classified as in pgsync? Is this relationship "legal"? We do many searches like that, and could explain our leak.
On your questions: ES was not totally empty as I ran a single pass sync version to get some results in. It also didn't put any results in if run without the prior sync. Our typical start up procedure deletes all of the checkpoint files and all elastic data for a clean start up during dev.
this is an example of our most boring possible search index
{
    "database": "steelhead",
    "index": "steelhead_search_part_number",
    "setting": {
      "analysis": {
        "analyzer": {
          "ngram_analyzer": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "ngram_tokenizer"
          }
        },
        "tokenizer": {
          "ngram_tokenizer": {
            "token_chars": [
              "letter",
              "digit",
              "punctuation",
              "symbol"
            ],
            "type": "ngram",
            "min_gram": "2",
            "max_gram": "10"
          }
        }
      },
      "max_ngram_diff": "8"
    },
    "plugins": [
      "IdInDomain"
    ],
    "nodes": {
      "table": "part_number",
      "schema": "steelhead",
      "columns": [
        "name",
        "created_at",
        "description_markdown"
      ],
      "children": [
        {
          "table": "user",
          "schema": "steelhead",
          "columns": [
            "name",
            "id",
            "created_at"
          ],
          "relationship": {
            "variant": "object",
            "type": "one_to_one"
          },
          "children": [
            {
              "table": "domain",
              "schema": "steelhead",
              "columns": [
                "id"
              ],
              "relationship": {
                "variant": "object",
                "type": "one_to_one"
              }
            }
          ]
        }
      ]
    }
  },
                                    
                                    
                                    
                                
I created a working branch to study this in more detail. I also have a script to insert random records So far memory peaks at about 640MB. I think the relationship you described would effectively be a many to many and many to many are not supported. I'll trying and play around with the relationship types to see the effect.
Apologies for the delay. This has required a bit of effort. I have identified the major cause for this. SQLAlchemy stores the SQL compilation cache on the engine - SQL Compilation Caching. This is intended for performance reason when subsequent queries are run. I am now clearing this cache on each iteration. This has all been merged into the main branch.