pgsync icon indicating copy to clipboard operation
pgsync copied to clipboard

Resuming sync is slow and utilises a large amount of disk space

Open voyc-jean opened this issue 2 years ago • 5 comments

PGSync version: 2.1.11 Postgres version: 12.7 Elasticsearch version: 7.10 Redis version: 5.0.6 Python version: 3.8.7 Problem Description:

When resuming PGsync after it was stopped for a while, a query runs that looks like the following:

SELECT *
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 162324251
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 163810458

When a replication slot isn't synced for a while on a busy database and the WAL grows, the above query can take a very long time to complete (in my case, around 20 minutes). On top of the query being slow, it also utilises a large amount of disk space (via temp storage) on the Postgres server:

image

These large dip in disk space begins when the above query executes. In the above case, it caused PGsync to crash with the following error:

sqlalchemy.exc.OperationalError: (psycopg2.errors.DiskFull) could not write to file "base/pgsql_tmp/pgsql_tmp19437.19": No space  left on device

If I change the query to only return for one xid, it still takes more than 20 minutes to return. My uneducated guess that that Postgres must read through the entirety of the WAL on disk in order to filter out the xids that the query requires.

In addition to the above, if the query does execute (after increasing the database disk space) PGSync might crash with the following exception depending on how much data is sitting in the WAL:

Traceback (most recent call last):
  File "/usr/local/bin/pgsync", line 7, in <module>
    sync.main()
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 1240, in main
    sync.pull()
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 1084, in pull
    self.logical_slot_changes(txmin=txmin, txmax=txmax)
  File "/usr/local/lib/python3.8/site-packages/pgsync/sync.py", line 333, in logical_slot_changes
    rows: list = self.logical_slot_peek_changes(
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 411, in logical_slot_peek_changes
    return self._logical_slot_changes(
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 372, in _logical_slot_changes
    return self.fetchall(statement)
  File "/usr/local/lib/python3.8/site-packages/pgsync/base.py", line 813, in fetchall
    rows = conn.execute(statement).fetchall()
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1289, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 325, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1845, in _execute_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2026, in _handle_dbapi_exception
    util.raise_(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 207, in raise_
    raise exception
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.DatabaseError: (psycopg2.DatabaseError) out of memory for query result

[SQL: SELECT xid, data 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES(%(PG_LOGICAL_SLOT_PEEK_CHANGES_1)s, NULL, NULL) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= %(param_1)s AND CAST(CAST(xid AS TEXT) AS BIGINT) < %(param_2)s]
[parameters: {'PG_LOGICAL_SLOT_PEEK_CHANGES_1': 'voyc_app_voyc_app_es', 'param_1': 172730985, 'param_2': 174377184}]
(Background on this error at: https://sqlalche.me/e/14/4xp6)

All 3.79GB of memory allocated to my PGSync container is exhausted as there are simply too many records returned at once.

voyc-jean avatar Mar 22 '22 15:03 voyc-jean

I have fixed this in the master branch so that we only fetch records from the logical replication slot in batched defined by PG_LOGICAL_SLOT_UPTO_NCHANGES which defaults to 5000.

toluaina avatar Mar 24 '22 09:03 toluaina

Though this fix does reduce the disk usage, though I'm not sure if the upto_nchanges parameter of PG_LOGICAL_SLOT_PEEK_CHANGES is working as you expect.

This query returns numerous results:

SELECT *
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 183066650
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 183116650

Whereas this query returns no results:

SELECT xid, lsn
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('voyc_app_voyc_app_es', NULL, 50000)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 183066650
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 183116650

I believe this causes the while loop here to evaluate to false immediately.

voyc-jean avatar Mar 29 '22 11:03 voyc-jean

i think you are right. I changed the logic on this one. Can you give it a try again please

toluaina avatar Mar 30 '22 20:03 toluaina

Hi @toluaina,

Sorry for the delay, I've tested but I'm still having issues with this.

I believe the problem might be with versions of Postgres < 14.0 - I have been testing with 12.7. In the release notes for PostgreSQL 14:

  • Allow logical decoding to be filtered by xid (Markus Wanner)

The filtering done by pgsync on the logical slot seems to read through all WAL segments in order to filter by XID. I plan on upgrading my Postgres to 14 in the near future and will test and report back once done.

voyc-jean avatar May 20 '22 12:05 voyc-jean

Cool. Do let me know the outcome thanks.

toluaina avatar May 20 '22 16:05 toluaina