franz-go icon indicating copy to clipboard operation
franz-go copied to clipboard

Fetch returns duplicate records after Kafka (RedPanda) disconnect

Open genzgd opened this issue 2 months ago • 0 comments

We're testing the behavior of PollRecords during a temporary Kafka outage and we're trying to avoid getting duplicate records.

In our integration test, we read 10k records from Redpanda, call PollRecords again, and in a separate go routine, commit. While the subsequent PollRecords is running, we pause the RedPanda container for 45 seconds. We then produce another 10k records to the same topic. The "follow up" PollRecords (which has been running for 45 seconds) successfully picks up those 10k records, but doesn't have time to commit them before we get a heartbeat error, and the client tries to rejoin the group:

2024-04-27 05:51:01.591 INF heartbeat errored err="UNKNOWN_MEMBER_ID: The coordinator is not aware of this member." group=TestKafkaRetries
2024-04-27 05:51:01.591 DBG entering OnPartitionsLost format=json org_id=test-org pipe_id=TestKafkaRetries service_id= type=kafka with={"trips_json":[0]}
2024-04-27 05:51:01.591 INF injecting fake fetch with an error err="unable to join group session: UNKNOWN_MEMBER_ID: The coordinator is not aware of this member." why="notification of group management loop error"
2024-04-27 05:51:01.591 INF assigning partitions format=json how=1 input=null why="clearing assignment at end of group management session"
2024-04-27 05:51:01.591 ERR join and sync loop errored backoff=213.24197 consecutive_errors=1 err="UNKNOWN_MEMBER_ID: The coordinator is not aware of this member." 

When the new group session begins, it picks up the commit from the first 10k records, but then rereads the second 10k and we end up processing them again.

Without tracking offsets ourselves for "exactly once" semantics, is there some way to avoid this scenario? It's challenging that the PollRecords group "survives" the 45 second outage and fetches records and everything seems okay, but then the heartbeat comes back and says "this is broken, I have to start over".

genzgd avatar Apr 27 '24 12:04 genzgd