kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-14073 Logging the reason for Snapshot

Open ashmeet13 opened this issue 2 years ago • 12 comments

When a snapshot is taken it is due to either of the following reasons -

  1. Max bytes were applied
  2. Metadata version was changed

Once the snapshot process is started, it will log the reason that initiated the process.

Updated existing tests to include code changes required to log the reason. I was not able to check the logs when running tests - could someone guide me on how to enable logs when running a specific test case.

Example logs after the changes -

[2022-07-25 14:34:39,769] INFO [Controller 3000] Generating a snapshot that includes (epoch=1, offset=0) after 91 committed bytes since the last snapshot, because max bytes exceeded. (org.apache.kafka.controller.QuorumController:1328)
[2022-07-25 14:40:51,783] INFO [BrokerMetadataSnapshotter id=2] Creating a new snapshot at offset 5 because metadata version changed... (kafka.server.metadata.BrokerMetadataSnapshotter:66)

Committer Checklist (excluded from commit message)

  • [ ] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

ashmeet13 avatar Jul 17 '22 13:07 ashmeet13

@dengziming since you raised the JIRA ticket for KAFKA-14073, would please review this PR.

ashmeet13 avatar Jul 18 '22 06:07 ashmeet13

@dengziming I have made the changes requested. I have also made the changes in QuorumController.maybeGenerateSnapshot to log the reason for the snapshot being generated.

maybeGenerateSnapshot seemed to be a better place than QuorumController.SnapshotGeneratorManager.createSnapshotGenerator since it already had logs for starting a snapshot.

Also, it seems like QuorumController only creates a snapshot when max bytes are exceeded, I couldn't find it calling the snapshot generator for the reason metadata version changed.

ashmeet13 avatar Jul 19 '22 18:07 ashmeet13

Hi @dengziming, could you review this once more? I have made the requested changes.

ashmeet13 avatar Jul 22 '22 06:07 ashmeet13

@dengziming I have made the requested changes and added the sample logs in the description of the PR.

ashmeet13 avatar Jul 25 '22 09:07 ashmeet13

Got it, working on these changes. Taking a bit of time to go through the code.

ashmeet13 avatar Aug 03 '22 14:08 ashmeet13

Hi @jsancio, I have defined an enum in the raft module for SnapshotReason and have used it in place of string messages that were being passed previously.

I am working on making the changes for RaftClient and KafkaMetadataLog. These changes will log the snapshot reason in createNewSnapshot function of KafkaMetadataLog.

I had two implementation queries, could you please help with them -

  1. How do we handle multiple reasons for starting a snapshot in an enum?
  2. I would also need to change the function signature of createNewSnapshot() in KafkaMetadataLog to accommodate the new SnapshotReason parameter. Below is the new signature. Would this be okay?
Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, SnapshotReason reason);

ashmeet13 avatar Aug 03 '22 19:08 ashmeet13

Thanks for the changes @ashmeet13 they look good in general.

1. How do we handle multiple reasons for starting a snapshot in an enum?

Instead of using Optional[SnapshotReason] we could use Set[SnapshotReason]. Depending on the layer we can associate different means to this set. When returned from def shouldSnapshot() the caller can assume that if the set is empty it means that it should not generate a snapshot.

2. I would also need to change the function signature of `createNewSnapshot()` in `KafkaMetadataLog` to accommodate the new `SnapshotReason` parameter. Below is the new signature. Would this be okay?
Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, SnapshotReason reason);

How about:

Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, Set<SnapshotReason> reasons);

If the set reasons is empty then the implementation of createNewSnapshot can assume that the reason is unknown.

What do you think?

jsancio avatar Aug 04 '22 16:08 jsancio

Thanks @jsancio, this is helpful. I'll go ahead and make these changes!

ashmeet13 avatar Aug 04 '22 16:08 ashmeet13

Thanks @jsancio, this is helpful. I'll go ahead and make these changes!

@ashmeet13 ,

I thought about this some more and I don't think we should include the reason for the snapshot in the raft module and types. In other words, I think we should:

  1. Move the SnapshotReason type from raft/src/main/java/org/apache/kafka/raft to metadata/src/main/java/org/apache/kafka/metadata/util.
  2. Keep the old signature for the createNewSnapshot. I suspect that we can undo all of the changes you made to the raft module.

I think this now because the reasons for creating a snapshot are specific to metadata and not raft. We have an interest in using the raft module and the KRaft protocol to solve other problems. I don't think we will have the same reasons for wanting to generate snapshot in those use cases.

We should still use the SnapshotReason type and I am just concern on where it should live. Thanks and excuse my earlier misguided advice.

jsancio avatar Aug 05 '22 12:08 jsancio

Hi @jsancio, just one another doubt - Does this mean we don't want to log the reason within KafkaMetadataLog and keep the logging as it is right now within BrokerMetadataSnapshotter and QuorumController?

ashmeet13 avatar Aug 05 '22 18:08 ashmeet13

Hi @jsancio, just one another doubt - Does this mean we don't want to log the reason within KafkaMetadataLog and keep the logging as it is right now within BrokerMetadataSnapshotter and QuorumController?

Yes. That's correct. Let's keep the logging we have now and extend the message to include the reason(s).

jsancio avatar Aug 05 '22 21:08 jsancio

Hi @jsancio, I have reverted the changes made to RaftClient and kept the logging withing BrokerMetadataSnapshotter and QuorumController

I have two questions, would really be helpful if you could guide me on this -

  1. We also take a snapshot in org.apache.kafka.raft within the class ReplicatedCounter under the handleCommit function. I am facing an issue in being able to import the SnapshotReason from the metadata/src/main/java/org/apache/kafka/metadata/util package. Could you guide me on how I could fix this?

  2. We take snapshots at two different places within QuorumController - One I was able to figure out the reason being MaxBytesExceeded. There is another function beginWritingSnapshot which also initiates a snapshot - I was not able to figure out who calls this function and what would the reason be in this scenario?

ashmeet13 avatar Aug 06 '22 17:08 ashmeet13

Hi @dengziming @jsancio, can you help with a review on this PR. If it looks okay to merge I can clean up the PR and remove temporary comments added by me.

ashmeet13 avatar Aug 16 '22 05:08 ashmeet13

Thank you @dengziming for the review. Made the changes to remove the comment I had added and fixed the typo in the log message.

ashmeet13 avatar Aug 16 '22 16:08 ashmeet13

Thanks @dengziming for the review. @jsancio could you please have a look at this PR?

ashmeet13 avatar Aug 23 '22 10:08 ashmeet13

Hi @jsancio, bumping this PR up in case it got missed. Could please have a look at this?

ashmeet13 avatar Aug 27 '22 15:08 ashmeet13

Hi @jsancio, I have reverted the changes made to RaftClient and kept the logging withing BrokerMetadataSnapshotter and QuorumController

I have two questions, would really be helpful if you could guide me on this -

1. We also take a snapshot in `org.apache.kafka.raft` within the class `ReplicatedCounter` under the `handleCommit` function.
   I am facing an issue in being able to import the `SnapshotReason` from the `metadata/src/main/java/org/apache/kafka/metadata/util` package. Could you guide me on how I could fix this?

ReplicatedCounter is mainly used to test the KRaft implementation. We should probably consider moving this type to raft/src/test.

2. We take snapshots at two different places within `QuorumController` - One I was able to figure out the reason being `MaxBytesExceeded`. There is another function `beginWritingSnapshot` which also initiates a snapshot - I was not able to figure out who calls this function and what would the reason be in this scenario?

It looks like this method is only used by tests. I think it is okay for this to have an "unknown reason" for now.

jsancio avatar Aug 31 '22 18:08 jsancio

Thank you @jsancio for the review. Have made the changes.

For the following comment -

It looks like this method is only used by tests. I think it is okay for this to have an "unknown reason" for now.

Currently the function in discussion has no logging taking place hence I haven't added the UnknownReason in. Should I update the function to log the same?

Edit - I have gone ahead and added the log the same way we are logging in the actual function within QuorumController. I can remove it if necessary.

ashmeet13 avatar Sep 01 '22 17:09 ashmeet13

Hi @jsancio could you please review this PR? Sorry for the multiple changes this has required.

ashmeet13 avatar Sep 07 '22 03:09 ashmeet13

Hi @jsancio could you please review this PR? Sorry for the multiple changes this has required.

Thanks for the changes and excuse the delays. LGTM in general. Restarted the build as it seems the JVM 17 and Scala 2.13 configuration failed to build.

jsancio avatar Sep 07 '22 16:09 jsancio

Thank you @jsancio, I see that the build for JVM 17 and JDK 2.13 has timed out again. What can be done from my end to fix this?

ashmeet13 avatar Sep 08 '22 13:09 ashmeet13