pgsync icon indicating copy to clipboard operation
pgsync copied to clipboard

Downtime

Open mmbfreitas opened this issue 7 months ago • 3 comments

PGSync version: 2.3.2, 3.3.0, 4.0.0 (i tested all this versions) Postgres version: 14.13 (RDS) Elasticsearch version: 8.3.2 Redis version: Redis server v=6.0.16 Python version: Python 3.10.12 .env ELASTICSEARCH_SCHEME=http ELASTICSEARCH_HOST=172.17.1.188 ELASTICSEARCH_PORT=9200 CONSOLE_LOGGING_HANDLER_MIN_LEVEL=WARNING ELASTICSEARCH_SSL_SHOW_WARN=False ELASTICSEARCH_USE_SSL=False REDIS_HOST=localhost REDIS_DB=1 ELASTICSEARCH_CHUNK_SIZE=500 REDIS_READ_CHUNK_SIZE=500 REDIS_WRITE_CHUNK_SIZE=500 REPLICATION_SLOT_CLEANUP_INTERVAL=180 REDIS_SOCKET_TIMEOUT=60 ELASTICSEARCH_TIMEOUT=60 QUERY_CHUNK_SIZE=500 USE_ASYNC=True

  • I have tried a lot of variation of this .env increase and decrease chunks value and nothing solved

Problem Description:

My index in Elastic has about 2500000 rows (2.7 GB) and sometimes it starts writing a lot even with not that many changes and simply stops the service or Redis number starts growing forever (even LLEN with 0)

also every time i need to bootstrap for testing it takes me 30 minutes to reindex

Something I noticed, but I don't know if it has anything to do with the case, is that has a lot scroll actions on pgsync debug log and creating many small segments in the index.

Any suggestion?

schema.json

[ { "database": "dbname", "index": "index1", "setting": { "analysis": { "normalizer": { "case_insensitive": { "filter": "lowercase" } } } }, "nodes": { "table": "table1", "columns": [ columns ], "transform": { "mapping": { "description": { "type": "text", "fields": { "keyword": { "type": "keyword", "normalizer": "case_insensitive" } } }, "price": { "type": "double" }, "percentage": { "type": "double" } } }, "children": [ { "table": "table2", "columns": [ "columns ], "transform": { "mapping": { "name": { "type": "text", "fields": { "keyword": { "type": "keyword", "normalizer": "case_insensitive" } } }, "tags": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }, "value": { "type": "double" } } }, "relationship": { "variant": "object", "type": "one_to_one" }, "children": [ { "table": "table3", "columns": [ colums ], "relationship": { "type": "one_to_many", "variant": "object" }, "transform": { "mapping": { "number": { "type": "text", "fields": { "keyword": { "type": "keyword", "normalizer": "case_insensitive" } } } } } }, { "table": "table4", "columns": [columns], "transform": { "mapping": { "name": { "type": "text", "fields": { "keyword": { "type": "keyword", "normalizer": "case_insensitive" } } } } }, "relationship": { "variant": "object", "type": "one_to_one" } }, { "table": "table5", "columns": [columns], "transform": { "mapping": { "name": { "type": "text", "fields": { "keyword": { "type": "keyword", "normalizer": "case_insensitive" } } } } }, "relationship": { "variant": "object", "type": "one_to_one" } }, { "table": "table6", "columns": [columns], "relationship": { "variant": "object", "type": "one_to_one" } }, { "table": "table7", "columns": [columns], "relationship": { "variant": "object", "type": "one_to_one" } }, { "table": "table8", "columns": [columns], "relationship": { "variant": "object", "type": "one_to_one" } } ] }, { "table": "table9", "columns": [columns], "relationship": { "variant": "object", "type": "one_to_one" } }, { "table": "table10", "columns": [columns], "relationship": { "variant": "object", "type": "one_to_one" } }, { "table": "table11", "columns": [columns], "relationship": { "variant": "object", "type": "one_to_one" } } ] } } ]

@toluaina can you help me? i have sent a message on discord to you too.

mmbfreitas avatar May 16 '25 12:05 mmbfreitas

First, try the defaults. You probably have a lot of records to process.

Try the following settings first, and then adjust them as needed:

  • increase the number of workers NUM_WORKERS the default is 2.
  • use the non async mode for now. USE_ASYNC=False

If this is still too slow, you can try the following: for an initial sync with large data, I would suggest trying the parallel_sync a higher number of workers. multiprocess should give you the best performance. parallel_sync -c schema.json

toluaina avatar May 25 '25 07:05 toluaina

If this is still too slow, you can try the following: for an initial sync with large data, I would suggest trying the parallel_sync a higher number of workers. multiprocess should give you the best performance. parallel_sync -c schema.json

It is not working. Receiving the following error:

TypeError: JSON_CAST() takes 1 positional argument but 2 were given
2025-10-28 14:57:23.202:ERROR:asyncio: Future exception was never retrieved
future: <Future finished exception=TypeError('JSON_CAST() takes 1 positional argument but 2 were given')>
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\tejas.parse\AppData\Local\Programs\Python\Python310\lib\concurrent\futures\process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\Scripts\parallel_sync", line 365, in run_task
    sync.search_client.bulk(
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\lib\site-packages\pgsync\search_client.py", line 139, in bulk
    self._bulk(
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\lib\site-packages\pgsync\search_client.py", line 197, in _bulk
    for ok, info in self.parallel_bulk(
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\lib\site-packages\elasticsearch\helpers\actions.py", line 472, in parallel_bulk
    for result in pool.imap(
  File "C:\Users\tejas.parse\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 870, in next
    raise value
  File "C:\Users\tejas.parse\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\tejas.parse\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 144, in _helper_reraises_exception
    raise ex
  File "C:\Users\tejas.parse\AppData\Local\Programs\Python\Python310\lib\multiprocessing\pool.py", line 388, in _guarded_task_generation
    for i, x in enumerate(iterable):
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\lib\site-packages\elasticsearch\helpers\actions.py", line 155, in _chunk_actions
    for action, data in actions:
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\lib\site-packages\pgsync\sync.py", line 1392, in sync
    self.query_builder.build_queries(
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\lib\site-packages\pgsync\querybuilder.py", line 1000, in build_queries
    self._root(node, txmin=txmin, txmax=txmax, ctid=ctid)
  File "C:\Coding\master-electronics\data-replication\pgsync\venv310\lib\site-packages\pgsync\querybuilder.py", line 403, in _root
    JSON_CAST(
TypeError: JSON_CAST() takes 1 positional argument but 2 were given
"""

The error does make sense as if you look at querybuilder.py line 403, it is passing 2 arguments to JSON_CAST function where it takes only 1 parameter

tejasparse21 avatar Oct 28 '25 22:10 tejasparse21

Im facing the same JSON_CAST issue, is there any progress made on this?

mukund-fiber avatar Dec 16 '25 00:12 mukund-fiber