pgsync Very slow operation on very large tables

PGSync version: 2.0.0 Postgres version: 13 Elasticsearch version: 7.10.2 Redis version: 5.0.3 Python version: 3.7.9

Problem Description: I'm running pgsync on AWS using EC2, hosted Elastic Search and hosted Postgres. My tables are huge - primary table has 160M rows and one of the child tables has >2B rows.

I've set the QUERY_CHUNK_SIZE to 1,000 as when set to 10,000 the process was locking up the server and I couldn't SSH on to it. I'm monitoring the process on the database side and can see it's on the second chunk from the SQL being run: FETCH FORWARD 1000 FROM "c_7fec50320ac8_1". The first chunk took about 4 hours so clearly it won't finish the whole table in a viable amount of time. I suspect this is partly due to the size of the tables and therefore the joins being performed.

Is there something I can change to improve this or am I better trying a different tool like logstash?

Thanks for your help

Jun 02 '21 16:06 timbarkerSE

Really sorry about the delay. 2 Billion is a lot! QUERY_CHUNK_SIZE is the only db related variable. I would suggest dropping this even further. Have you considered tuning any Postgres parameters. Also is the database normalised and with the required indices?

Maybe drop QUERY_CHUNK_SIZE to a very small number and then examine the resulting Query. You can slo send me the db schema and pgsync schema as well

@timbarkerSE

Jun 08 '21 20:06 toluaina

There is an edge case where this could be a problem. If you have parent table that has a relationship to large number of child tables. Then the chunk only applies to the parent.

I would be keen to understand the structure of you database and the size of each row. Also can you give an indication of the data distribution e.g the ration of parent to child, and can each parent have a large number of children etc.

Jun 08 '21 21:06 toluaina

Thanks for replying. The process did speed up in the end and the chunks were taking about 10 mins each with the whole process taking about 10 days. Still slightly long for what I need but better.

In the end I decided to write my own script to do the syncing because I could build a really bespoke and efficient query that runs really quickly. My use case is primary table instagram posts and child table hashtags (via huge join table). The hashtags are contained within the caption in the primary table so it's easier to derive them from there than handle the huge join! Lots of my other joins are many-to-many relations but in reality there's only one record so it's many-to-one and I could write fast SQL that handles that situation.

Anyway, I appreciate your work on pgsync and taking the time to answer issues.

Jun 09 '21 08:06 timbarkerSE

Thanks for this. I am keep to learn from what you did to optimise the query path.

Jun 16 '21 19:06 toluaina