kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-14467:Add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error

Open hudeqi opened this issue 1 year ago • 5 comments

  1. As jira described, add a test to validate the replica state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error, especially for the transactional state.
  2. In producerStateManager, first call truncateFullyAndStartAt to clean up the snapshot files, and then pull the snapshot file from RemoteLogManager. If the order of calls is reversed in original logic, it may cause the newly built snapshot file to be cleaned up again by truncateFullyAndStartAt.

hudeqi avatar Oct 27 '23 12:10 hudeqi

Heya @hudeqi, thank you for the contribution! I have been reviewing this code change and I am a bit uncertain to its purpose so I wanted to ask some follow up questions. As far as I understand the current code flow roughly does the following:

  1. Download a snapshot file from remote storage
  2. Reset data structures within ProducerStateManager and delete snapshots present in the snapshots data structure via the truncateFullyAndStartAt
  3. Read all snapshot files on disk and repopulate the data structures inside the ProducerStateManager

However, since downloading the snapshot from remote storage does not update the snapshots data structure I do not see how the new file will be deleted as part of the call to truncateFullyAndStartAt.

I also found the JIRA description a bit confusing because it kept on linking to comments people made, but none of them detailed how this could be happening.

Could you elaborate how the call to truncateFullyAndStartAt will delete the newly downloaded file? Alternatively have I misunderstood what you mean to do with this pull request?

clolov avatar Oct 31 '23 10:10 clolov

Heya @hudeqi, thank you for the contribution! I have been reviewing this code change and I am a bit uncertain to its purpose so I wanted to ask some follow up questions. As far as I understand the current code flow roughly does the following:

  1. Download a snapshot file from remote storage
  2. Reset data structures within ProducerStateManager and delete snapshots present in the snapshots data structure via the truncateFullyAndStartAt
  3. Read all snapshot files on disk and repopulate the data structures inside the ProducerStateManager

However, since downloading the snapshot from remote storage does not update the snapshots data structure I do not see how the new file will be deleted as part of the call to truncateFullyAndStartAt.

I also found the JIRA description a bit confusing because it kept on linking to comments people made, but none of them detailed how this could be happening.

Could you elaborate how the call to truncateFullyAndStartAt will delete the newly downloaded file? Alternatively have I misunderstood what you mean to do with this pull request?

Sorry for not stating it clearly in this jira. This jira was originally for adding a unit test to validate the transactional state after processing the OFFSET_MOVED_TO_TIERED_STORAGE error. When I was adding unit tests for this logic, I discovered "pulling snapshots from the remote and the file may then be cleaned" issue failed in testing, this issue has not yet been reflected in jira.

As for how this issue occurs in the original logic: snapshots is a map, and the key is a long type offset. If the name (offset value) of the snapshot file first pulled from the remote storage and constructed happens to be in the keyset of local snapshots, there may be problems with being cleaned up later. @clolov

I don't know if my understanding and processing are correct. @satishd please help to confirm. Thanks.

hudeqi avatar Nov 01 '23 11:11 hudeqi

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Feb 19 '24 03:02 github-actions[bot]

@hudeqi , are you still interested in completing this PR? There seems to have some unaddressed comments..=

showuon avatar Feb 19 '24 06:02 showuon

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Jun 26 '24 03:06 github-actions[bot]