airbyte
airbyte copied to clipboard
[source-postgres] Missing replication slot caused by Azure PG Flex Server failover event leads to data loss
Connector Name
source-postgres
Connector Version
3.*
What step the error happened?
Other
Revelant information
We're facing some issues/limitations with consuming data from Azure Postgres V13 via CDC with the source-postgres container.
We are faced with an issue while extracting data from our Postgres instance
Source information: Azure PostgreSQL 13.10 https://learn.microsoft.com/en-us/azure/postgresql/
The problem occurs when an Azure Postgres failover event occurs. The Azure Postgres db occasionally triggers a failover event which causes a new database replica to spin up and the old replica is brought down.
For more information on the HA failover events: https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-high-availability
When this occurs, our Airbyte connector fails with a Airbyte error: """ Log4j2Appender says: ERROR i.a.i.b.AirbyteExceptionHandler(uncaughtException):26 Something went wrong in the connector. See the logs for more details. java.lang.RuntimeException: Unable establish a connection: Could not connect with provided configuration. Error: Expected exactly one replication slot but found 0. Please read the docs and add a replication slot to your database. """
The reason for the above error: There is a known limitation with the Postgres offering we are consuming data from: https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/concepts-logical#limitations The postgres replication slot does not exist on the new replica. The replication slot is destroyed along with the previous DB replica. The new DB replica does not have a replication slot initialised. https://docs.airbyte.com/integrations/sources/postgres/#configuring-postgres-connector-with-change-data-capture-cdc
Our usual process:
- We need to manually recreated the replication slot on the postgres_db. SQL:
select * from pg_create_logical_replication_slot('airbyte_somedb_pgoutput', 'pgoutput');
- Resync the job When we resync the job the LSN offset that we last consumed successfully before the recovery, is at a moment in the past before we re-initialised the connection. source-postgres then ignores the offset and triggers a full re-sync:
Saved offset is before Replication slot's confirmed_flush_lsn, Airbyte will trigger sync from scratch
This is an issue for us as we lose some CDC data between the point of failover and the point of recovery: Yes Airbyte performs a full refresh of the postgres db source, and ingests all of the data that exists at that moment in time (At the time of recovery), but what about the DELETE's that have occurred during the replication slot downtime?
^ Notice that there is a gap between LSN:
1208546759936
- 1210107702704
where we lose out on some CDC data between these windows.
Conclusion
When this occurs:
Saved offset is before Replication slot's confirmed_flush_lsn, Airbyte will trigger sync from scratch
We can incur CDC data loss.
What we want
We would like to re-sync the connector to recover at the exact LSN we last successfully consumed.
Possible solutions
- We recover the last LSN we have consumed from the Airbyte state db:
-- Get state of connection
select state from state where connection_id="<connection_id>"
Within the nested object in the state field, the latest LSN we consumed from the WAL is stored in the lsn_proc
and lsn_commit
fields:
{"state": {"{\"schema\":null,\"payload\":[\"db_name\",{\"server\":\"db_name\"}]}": "...,\"lsn_proc\":1209989616504,\"lsn_commit\":1209989616504,\"lsn\":1209989616504,..."}}
We would like to take this value and re-initialise our sync at the last LSN we have consumed.
This potentially could be achieved by using something along the lines of: pg_replication_origin_advance to set our replication to a given LSN.
Interested to hear what folks think about this.
Relevant log output
2023-01-01 00:00:00 source > INFO i.a.i.d.i.PostgresDebeziumStateUtil(isSavedOffsetAfterReplicationSlotLSN):58 Replication slot confirmed_flush_lsn : 1209989616504 Saved offset LSN : 1208546759936
2023-01-01 00:00:02 source > WARN i.a.i.s.p.PostgresSource(getIncrementalIterators):381 Saved offset is before Replication slot's confirmed_flush_lsn, Airbyte will trigger sync from scratch
2023-01-01 00:00:03 source > INFO i.a.i.s.p.PostgresCdcTargetPosition(targetPosition):50 identified target lsn: PgLsn{lsn=1210107702704}
Contribute
- [X] Yes, I want to contribute
Am not sure I understand the requirement, based on the description it seems like the replication slot is dropped and we manually create a new replication slot and use that. But the LSN is lost. You can't recover it cause Postgres would delete the WAL associate with that LSN. We wont be able to read the WAL from the old LSN using the new replication slot.
If they can somehow configure the new replication slot to point to the old LSN then they can extract the LSN value from the last sync from the logs and then point it to that LSN, Airbtye can not do this.
Although I doubt this is possible. Like you can move the LSN forward for a replication slot, not backwards.
@subodh1810 would pg_replication_origin_advance have helped to advance the lsn on the failover instance so that we could resume the sync from that instant? (this function was provided from @seanglynn-thrive writeup)
@prateekmukhedkar Yes I saw that but moving the LSN backwards doesnt seem possible. They are thinking that the new replication slot could have the confirmed_flush_lsn of the old replication slot but I doubt its possible cause I have never heard about a scenario where LSN can go backwards for a replication slot
@subodh1810 would pg_replication_origin_advance have helped to advance the lsn on the failover instance so that we could resume the sync from that instant? (this function was provided from @seanglynn-thrive writeup)
@subodh1810 @prateekmukhedkar
Thank you for confirming this. After some analysis and testing, we've validated that it is not possible to use the pg_replication_origin_advance()
postgres function to point at an LSN in the past as you have mentioned.
We're currently working on finding the best solution to assure we incur 0% data loss.
Our plan so far:
- (Short term): Develop a prerequisite script that will check that the replication slot exists within the db and recreate it if not exists.
- (Long term): Work with Azure support to find the best option to avoid CDC downtime/data loss.
Regarding the pre-sync script to check/recreate the replication on the db, in your opinion is this a worthwhile approach to handling the missing replication slots during a failover event. Do you feel something like this could reside within the Airbyte source-postgres connector or as a pre-sync step in an orchestration workflow (Airbyte, Dagster etc.)?
I'm surprised nobody else has encountered this issue while using source-postgres (Or the wider debezium community), as it does seem to be a common Postgres offering on Azure 🤔
Please let me know your thoughts on this. Thank you for your assistance and feedback up until now!
@seanglynn-thrive In theory at the beginning of the sync Airbyte could check if the replication slot exists or not and if not then create it by itself but creating a replication slot requires additional privileges/permissions. With the current set of permissions that we ask from the user it's not possible for us to create it on our own. In my opinion Airbyte should not have these extra privileges and the user should always be responsible for create the replication slot.
Regarding the pre-sync script to check/recreate the replication on the db, in your opinion is this a worthwhile approach to handling the missing replication slots during a failover event. Do you feel something like this could reside within the Airbyte source-postgres connector or as a pre-sync step in an orchestration workflow (Airbyte, Dagster etc.)?
@seanglynn-thrive In theory at the beginning of the sync Airbyte could check if the replication slot exists or not and if not then create it by itself but creating a replication slot requires additional privileges/permissions. With the current set of permissions that we ask from the user it's not possible for us to create it on our own. In my opinion Airbyte should not have these extra privileges and the user should always be responsible for create the replication slot.
Regarding the pre-sync script to check/recreate the replication on the db, in your opinion is this a worthwhile approach to handling the missing replication slots during a failover event. Do you feel something like this could reside within the Airbyte source-postgres connector or as a pre-sync step in an orchestration workflow (Airbyte, Dagster etc.)?
@subodh1810 I agree this should not be the responsibility of the Airbyte source connector.
We have started developing a "PG slot check and create if not exists" function which we invoke as a prerequisite step for each of our CDC ingress pipelines, managed by our orchestrator (Dagster). The workflow is as follows:
- Check replication slot for DB X
- If slot exists => continue
- If slot does not exist => fire slack alert; recreate replication slot; continue
- Run Airbyte CDC sync job
- Build downstream data assets I can elaborate more on this once we have a hardened solution. Let me know your thoughts on this.
@seanglynn-thrive The above makes sense! Just wanted to remind that every time the replication slot will be dropped and re-created, Airbyte will trigger a sync from scratch.