airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

[source-postgres] Not respecting sync mode Append + Dedupl

Open marcraminv opened this issue 10 months ago • 4 comments

Connector Name

source-postgres

Connector Version

3.2.21

What step the error happened?

During the sync

Relevant information

platform version: airbyte 0.50.33 platform deployment: Docker container on bare EC2 instance source: airbyte/source-postgres:3.2.21 | Update postgres method: Xmin System Column destination: airbyte/destination-snowflake:3.4.10

table configuration: Incremental | Append + dedupl

The team has set up a daily job that reads data from Postgres and writes to Snowflake. The Postgres source database is cloned from the Production DB at midnight, so we have a full snapshot of what is happening.

As you will see in the pictures, the daily job took around ~12 minutes, but it started taking hours. The sync looks to run a full refresh each time instead of appending and deduplicating. Since that day, the pipeline has been taking hours daily.

It's interesting because attempt 4 on the last job failed because of memory, but the extracted number of records is around 33M, while attempt 5 extracted 110M rows. The difference is enormous.

What could be the issue?

SCR-20240408-stbm SCR-20240408-tnfq SCR-20240408-tnhi

Relevant log output

No response

Contribute

  • [X] Yes, I want to contribute

marcraminv avatar Apr 08 '24 21:04 marcraminv

@marcraminv can you share the complete log file for the short and long syncs? It can help us identifying what is causing the issue.

marcosmarxm avatar Apr 10 '24 11:04 marcosmarxm

Here you have it @marcosmarxm ! Log file day for short run on 14.03.2024: short-sync Log file day for the long run (attempt 5) on 15.03.2024: long-sync

After the day 15th of Mar. All syncs are reading the full dataset. Let me know if I can contribute to this, or following up with you

marcraminv avatar Apr 10 '24 12:04 marcraminv

From the logs - I see that you are using xmin. In addition, it has undergone wraparound resulting in a full sync each time. This tends to happen when the database gets quite large (exceeds a larger number of transactions).

See https://docs.airbyte.com/integrations/sources/postgres#xmin

I would recommend switching to a different sync mode - CDC is preferred if not cursor based.

akashkulk avatar Apr 30 '24 21:04 akashkulk

Hey @akashkulk, But the stream size is not large, as you can see on the images the total amount of data is less than 100Gb.

marcraminv avatar May 02 '24 08:05 marcraminv

Hi @marcraminv, that is true - but in some cases even with a small amount of data a wraparound can occur. This can happen if there are a large number of transactions in your database. These are referred to as wraparounds. In this case - that is what is happening

Xmin Status : {Number of wraparounds: 2, Xmin Transaction Value: 8711059, Xmin Raw Value: 8598645651

So in your case I'd recommend moving to cursor based or CDC if you want to avoid these resyncs

akashkulk avatar May 08 '24 18:05 akashkulk