pgsync icon indicating copy to clipboard operation
pgsync copied to clipboard

Memory problem

Open eluizbr opened this issue 3 years ago • 19 comments

PGSync version: 2.3.2

Postgres version: 12.11

Elasticsearch version: 7.10

Redis version: 6.2.6

Python version: 3.10.8

Problem Description: Friends, I have a PGSYNC implementation on a kubernetes, I have been facing constant problems with memory overflow.

pgsync starts using memory until it runs out of memory. Do you have any config or guidance you can give me?

Error Message (if any):



eluizbr avatar Oct 20 '22 20:10 eluizbr

Can you share more details. How much RAM is PGSync using? Which is the overall size of your DB including row count. You can run this Query to produce some stats

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 10;

toluaina avatar Oct 24 '22 21:10 toluaina

Hi, Thanks for your reply:

PGSYNC Memory : 2GB

public.refurbishItems,2596 MB public.sessions,1804 MB public.task,1662 MB public.searches,1235 MB public.activities,382 MB public.refurbishes,334 MB public.files,308 MB public.UserPermissions,197 MB public.items,160 MB public.refurbishFiles,109 MB

eluizbr avatar Oct 24 '22 21:10 eluizbr

  • are you running this in a docker container?
  • 2 GB seems like a limit on some Docker configuration (can you try increasing this)
  • what do you mean by memory overlflow?
  • if you are running into the OOM killer then you do need to allocate more RAM to overcome the initial sync

toluaina avatar Nov 04 '22 21:11 toluaina

Hi @toluaina , i work with Emerson, we solved that, thanks... however one of our replication slots are stoping syncing occasionally and we can't figure out why. What's is happening is that some point the slot stop syncing and the size of it start growing indefinitely, and it is happening just with one table, Do you have any suggestions?

thanks

mmbfreitas avatar Jan 24 '23 14:01 mmbfreitas

Hi @toluaina, Just to complement... It occurs between 1 or 2 times a week. We need to kill the PGSync POD to get it working normally again. I don't know if it's useful information, but it's on a Kubernets

POSTGRES: 12.11 AWS RDS PYTHON: 3.10.8 (DOCKER / KUBERNETS)

this is our ENV today:

CONSOLE_LOGGING_HANDLER_MIN_LEVEL=WARNING ELASTICSEARCH_SSL_SHOW_WARN=False ELASTICSEARCH_USE_SSL=False QUERY_CHUNK_SIZE=20000 REDIS_HOST=HOST_HERE ELASTICSEARCH_AWS_HOSTED=true ELASTICSEARCH_CHUNK_SIZE=6000 REDIS_READ_CHUNK_SIZE=3000 REDIS_WRITE_CHUNK_SIZE=3000 ELASTICSEARCH_STREAMING_BULK=True

LOGS:

Sync homehero Xlog: [16] => Db: [746] => Redis: [total = 746 pending = 0] => Elastic: [18,248] ... Sync homehero Xlog: [17] => Db: [744] => Redis: [total = 744 pending = 0] => Elastic: [64,152] ... Sync homehero Xlog: [16] => Db: [746] => Redis: [total = 746 pending = 0] => Elastic: [18,248] ... Sync homehero Xlog: [17] => Db: [744] => Redis: [total = 744 pending = 0] => Elastic: [64,152] ... Sync homehero Xlog: [17] => Db: [744] => Redis: [total = 744 pending = 0] => Elastic: [64,152] ... Sync homehero Xlog: [16] => Db: [746] => Redis: [total = 746 pending = 0] => Elastic: [18,248] ...

eluizbr avatar Jan 24 '23 16:01 eluizbr

@toluaina Do you have any ideia? we are having a lot of this problem live

mmbfreitas avatar Jan 30 '23 15:01 mmbfreitas

Hi,

Sorry about the delay.

  • how many replication slots do you have?
  • does that mean the replication slot(s) is/are growing continuously?
  • when pgsync runs in daemon mode, it does not need or read the replication slot.
  • do you have REPLICATION_SLOT_CLEANUP_INTERVAL set? what value?
  • perhaps you should set this to low value so you can see the replication slot drop in size.

toluaina avatar Jan 30 '23 16:01 toluaina

also the replication slot is tied to the database and not to individual tables. do you mean pgsync stops syncing and the replication slots starts growing?

toluaina avatar Jan 30 '23 16:01 toluaina

Hi, @toluaina how many replication slots do you have? does this mean that replication slots are continually growing?

  • Yes... growing daily... when pgsync runs in daemon mode, it doesn't even need to read the replication slot.
  • I didn't understand... do you have REPLICATION_SLOT_CLEANUP_INTERVAL set? what value?
  • It is not defined.. it has the default value. Any suggestion ? perhaps you should set this to a low value so that you can see the replication slot size decrease.
  • What is your suggestion ?

eluizbr avatar Jan 30 '23 16:01 eluizbr

So the default is 180 secs (3 mins) as described here We just want to confirm the slots are getting truncated each time it runs. So we it to a low value like 10 secs to see if/when the replication slot gets truncated.

toluaina avatar Jan 30 '23 16:01 toluaina

@toluaina We only have the problem with a specific slot... today I fear about 10 slots...

eluizbr avatar Jan 30 '23 16:01 eluizbr

Hey tanks for helping,

how many replication slots do you have?

R: We have 2, pgsync is creating one to each mapped table on schema

does that mean the replication slot(s) is/are growing continuously?

R: yes, analyzing logs on AWS the oldest replication lag and the transaction log disk usage

transaction log disk image

oldes replication lag

image

mmbfreitas avatar Jan 30 '23 16:01 mmbfreitas

also the replication slot is tied to the database and not to individual tables. do you mean pgsync stops syncing and the replication slots starts growing?

exactly what happens....

eluizbr avatar Jan 30 '23 16:01 eluizbr

So after 10 secs. You should see this line in the logs. assuming REPLICATION_SLOT_CLEANUP_INTERVAL is 10 "Truncating replication slot: "

toluaina avatar Jan 30 '23 17:01 toluaina

So after 10 secs. You should see this line in the logs. assuming REPLICATION_SLOT_CLEANUP_INTERVAL is 10 "Truncating replication slot: "

We try 15 seconds... If the problem persists, let us know here for you... Thanks for your help.

eluizbr avatar Jan 30 '23 17:01 eluizbr

@toluaina First of all it is important to say that we have a lambda function to clean our slots every day at 8:45 UTC, we created that because we are using postgres 12 and it does not provide max_slot_wal_keep_size options, and sometimes the size of slots was growing indefinitely

So i have just analyzed that old replication lag start to grow always at this time 8:45, however this not happened every day, there aren't a pattern to happen that.

Looking into logs i have seen this

2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_installments', NULL, NULL) 2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:LOG: logical decoding found consistent point at C6F/140008C8 2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:DETAIL: There are no running transactions. 2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_installments', NULL, NULL) 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:LOG: starting logical decoding for slot "homehero_payments" 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:DETAIL: Streaming transactions committing after C6F/18000110, reading WAL from C6F/18000028. 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_payments', NULL, NULL) 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:LOG: logical decoding found consistent point at C6F/18000028 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:DETAIL: There are no running transactions. 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_payments', NULL, NULL) 2023-01-30 08:45:07 UTC:172.17.3.182(43982):homehero@homehero:[11864]:ERROR: replication slot "homehero_installments" does not exist 2023-01-30 08:45:07 UTC:172.17.3.182(43982):homehero@homehero:[11864]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_installments', NULL, NULL) 2023-01-30 08:45:15 UTC:172.17.3.182(53812):homehero@homehero:[22735]:LOG: logical decoding found consistent point at C6F/1C04E8A0 2023-01-30 08:45:15 UTC:172.17.3.182(53812):homehero@homehero:[22735]:DETAIL: There are no running transactions. 2023-01-30 08:45:15 UTC:172.17.3.182(53812):homehero@homehero:[22735]:STATEMENT: SELECT * FROM PG_CREATE_LOGICAL_REPLICATION_SLOT('homehero_payments', 'test_decoding') 2023-01-30 08:45:22 UTC:172.17.3.182(53814):homehero@homehero:[22751]:LOG: logical decoding found consistent point at C6F/1C079E38 2023-01-30 08:45:22 UTC:172.17.3.182(53814):homehero@homehero:[22751]:DETAIL: There are no running transactions. 2023-01-30 08:45:22 UTC:172.17.3.182(53814):homehero@homehero:[22751]:STATEMENT: SELECT * FROM PG_CREATE_LOGICAL_REPLICATION_SLOT('homehero_installments', 'test_decoding') 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: starting logical decoding for slot "homehero_payments" 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: Streaming transactions committing after C6F/1C04E8D8, reading WAL from C6F/1C04E8A0. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT count(*) AS count_1 FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: logical decoding found consistent point at C6F/1C04E8A0 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: There are no running transactions. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT count(*) AS count_1 FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: starting logical decoding for slot "homehero_payments" 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: Streaming transactions committing after C6F/1C04E8D8, reading WAL from C6F/1C04E8A0. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 LIMIT 5000 OFFSET 0 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: logical decoding found consistent point at C6F/1C04E8A0 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: There are no running transactions. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 LIMIT 5000 OFFSET 0

at 8:45 ERROR: replication slot "homehero_installments" does not exist happened, after that pgsync has created replication slots again, payment back to work, but installment doesn't

mmbfreitas avatar Jan 30 '23 18:01 mmbfreitas

  • It seems you have a number of replication slots.
  • I'm trying to understand which one is the culprit.
  • Replication slot homehero_installments should already exist before running pgsync.
  • The fact that it does not exist implies that the bootstrap was not run yet.
  • bootstrap -c schema.json should be run just once to set things up including ALL your replication slots.
  • then run pgsync -c schema.json to sync

PS: I'm happy to move this discussion to discord if you feel necessary.

toluaina avatar Jan 30 '23 19:01 toluaina

  • It seems you have a number of replication slots.
  • I'm trying to understand which one is the culprit.
  • Replication slot homehero_installments should already exist before running pgsync.
  • The fact that it does not exist implies that the bootstrap was not run yet.
  • bootstrap -c schema.json should be run just once to set things up including ALL your replication slots.
  • then run pgsync -c schema.json to sync

PS: I'm happy to move this discussion to discord if you feel necessary.

I run bootstrap -c schema.json every time the container is recreated... Could this be a problem then? About going to discord, that would be great...

eluizbr avatar Jan 30 '23 19:01 eluizbr

  • running bootstrap each time is not necessary.
  • Forgot to drop my discord user: taina#4505

toluaina avatar Jan 31 '23 20:01 toluaina