airbyte
airbyte copied to clipboard
[source-postgres] Not respecting sync mode Append + Dedupl
Connector Name
source-postgres
Connector Version
3.2.21
What step the error happened?
During the sync
Relevant information
platform version: airbyte 0.50.33 platform deployment: Docker container on bare EC2 instance source: airbyte/source-postgres:3.2.21 | Update postgres method: Xmin System Column destination: airbyte/destination-snowflake:3.4.10
table configuration: Incremental | Append + dedupl
The team has set up a daily job that reads data from Postgres and writes to Snowflake. The Postgres source database is cloned from the Production DB at midnight, so we have a full snapshot of what is happening.
As you will see in the pictures, the daily job took around ~12 minutes, but it started taking hours. The sync looks to run a full refresh each time instead of appending and deduplicating. Since that day, the pipeline has been taking hours daily.
It's interesting because attempt 4 on the last job failed because of memory, but the extracted number of records is around 33M, while attempt 5 extracted 110M rows. The difference is enormous.
What could be the issue?
Relevant log output
No response
Contribute
- [X] Yes, I want to contribute
@marcraminv can you share the complete log file for the short and long syncs? It can help us identifying what is causing the issue.
Here you have it @marcosmarxm ! Log file day for short run on 14.03.2024: short-sync Log file day for the long run (attempt 5) on 15.03.2024: long-sync
After the day 15th of Mar. All syncs are reading the full dataset. Let me know if I can contribute to this, or following up with you
From the logs - I see that you are using xmin. In addition, it has undergone wraparound resulting in a full sync each time. This tends to happen when the database gets quite large (exceeds a larger number of transactions).
See https://docs.airbyte.com/integrations/sources/postgres#xmin
I would recommend switching to a different sync mode - CDC is preferred if not cursor based.
Hey @akashkulk, But the stream size is not large, as you can see on the images the total amount of data is less than 100Gb.
Hi @marcraminv, that is true - but in some cases even with a small amount of data a wraparound can occur. This can happen if there are a large number of transactions in your database. These are referred to as wraparounds. In this case - that is what is happening
Xmin Status : {Number of wraparounds: 2, Xmin Transaction Value: 8711059, Xmin Raw Value: 8598645651
So in your case I'd recommend moving to cursor based or CDC if you want to avoid these resyncs