kafka
kafka copied to clipboard
KAFKA-14073 Logging the reason for Snapshot
When a snapshot is taken it is due to either of the following reasons -
- Max bytes were applied
- Metadata version was changed
Once the snapshot process is started, it will log the reason that initiated the process.
Updated existing tests to include code changes required to log the reason. I was not able to check the logs when running tests - could someone guide me on how to enable logs when running a specific test case.
Example logs after the changes -
[2022-07-25 14:34:39,769] INFO [Controller 3000] Generating a snapshot that includes (epoch=1, offset=0) after 91 committed bytes since the last snapshot, because max bytes exceeded. (org.apache.kafka.controller.QuorumController:1328)
[2022-07-25 14:40:51,783] INFO [BrokerMetadataSnapshotter id=2] Creating a new snapshot at offset 5 because metadata version changed... (kafka.server.metadata.BrokerMetadataSnapshotter:66)
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@dengziming since you raised the JIRA ticket for KAFKA-14073, would please review this PR.
@dengziming I have made the changes requested. I have also made the changes in QuorumController.maybeGenerateSnapshot
to log the reason for the snapshot being generated.
maybeGenerateSnapshot
seemed to be a better place than QuorumController.SnapshotGeneratorManager.createSnapshotGenerator
since it already had logs for starting a snapshot.
Also, it seems like QuorumController
only creates a snapshot when max bytes are exceeded, I couldn't find it calling the snapshot generator for the reason metadata version changed.
Hi @dengziming, could you review this once more? I have made the requested changes.
@dengziming I have made the requested changes and added the sample logs in the description of the PR.
Got it, working on these changes. Taking a bit of time to go through the code.
Hi @jsancio, I have defined an enum in the raft module for SnapshotReason
and have used it in place of string messages that were being passed previously.
I am working on making the changes for RaftClient
and KafkaMetadataLog
. These changes will log the snapshot reason in createNewSnapshot
function of KafkaMetadataLog
.
I had two implementation queries, could you please help with them -
- How do we handle multiple reasons for starting a snapshot in an enum?
- I would also need to change the function signature of
createNewSnapshot()
inKafkaMetadataLog
to accommodate the newSnapshotReason
parameter. Below is the new signature. Would this be okay?
Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, SnapshotReason reason);
Thanks for the changes @ashmeet13 they look good in general.
1. How do we handle multiple reasons for starting a snapshot in an enum?
Instead of using Optional[SnapshotReason]
we could use Set[SnapshotReason]
. Depending on the layer we can associate different means to this set. When returned from def shouldSnapshot()
the caller can assume that if the set is empty it means that it should not generate a snapshot.
2. I would also need to change the function signature of `createNewSnapshot()` in `KafkaMetadataLog` to accommodate the new `SnapshotReason` parameter. Below is the new signature. Would this be okay?
Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, SnapshotReason reason);
How about:
Optional<RawSnapshotWriter> createNewSnapshot(OffsetAndEpoch snapshotId, Set<SnapshotReason> reasons);
If the set reasons
is empty then the implementation of createNewSnapshot
can assume that the reason is unknown.
What do you think?
Thanks @jsancio, this is helpful. I'll go ahead and make these changes!
Thanks @jsancio, this is helpful. I'll go ahead and make these changes!
@ashmeet13 ,
I thought about this some more and I don't think we should include the reason for the snapshot in the raft
module and types. In other words, I think we should:
- Move the
SnapshotReason
type fromraft/src/main/java/org/apache/kafka/raft
tometadata/src/main/java/org/apache/kafka/metadata/util
. - Keep the old signature for the
createNewSnapshot
. I suspect that we can undo all of the changes you made to theraft
module.
I think this now because the reasons for creating a snapshot are specific to metadata
and not raft
. We have an interest in using the raft
module and the KRaft protocol to solve other problems. I don't think we will have the same reasons for wanting to generate snapshot in those use cases.
We should still use the SnapshotReason
type and I am just concern on where it should live. Thanks and excuse my earlier misguided advice.
Hi @jsancio, just one another doubt - Does this mean we don't want to log the reason within KafkaMetadataLog
and keep the logging as it is right now within BrokerMetadataSnapshotter
and QuorumController
?
Hi @jsancio, just one another doubt - Does this mean we don't want to log the reason within
KafkaMetadataLog
and keep the logging as it is right now withinBrokerMetadataSnapshotter
andQuorumController
?
Yes. That's correct. Let's keep the logging we have now and extend the message to include the reason(s).
Hi @jsancio, I have reverted the changes made to RaftClient
and kept the logging withing BrokerMetadataSnapshotter
and QuorumController
I have two questions, would really be helpful if you could guide me on this -
-
We also take a snapshot in
org.apache.kafka.raft
within the classReplicatedCounter
under thehandleCommit
function. I am facing an issue in being able to import theSnapshotReason
from themetadata/src/main/java/org/apache/kafka/metadata/util
package. Could you guide me on how I could fix this? -
We take snapshots at two different places within
QuorumController
- One I was able to figure out the reason beingMaxBytesExceeded
. There is another functionbeginWritingSnapshot
which also initiates a snapshot - I was not able to figure out who calls this function and what would the reason be in this scenario?
Hi @dengziming @jsancio, can you help with a review on this PR. If it looks okay to merge I can clean up the PR and remove temporary comments added by me.
Thank you @dengziming for the review. Made the changes to remove the comment I had added and fixed the typo in the log message.
Thanks @dengziming for the review. @jsancio could you please have a look at this PR?
Hi @jsancio, bumping this PR up in case it got missed. Could please have a look at this?
Hi @jsancio, I have reverted the changes made to
RaftClient
and kept the logging withingBrokerMetadataSnapshotter
andQuorumController
I have two questions, would really be helpful if you could guide me on this -
1. We also take a snapshot in `org.apache.kafka.raft` within the class `ReplicatedCounter` under the `handleCommit` function. I am facing an issue in being able to import the `SnapshotReason` from the `metadata/src/main/java/org/apache/kafka/metadata/util` package. Could you guide me on how I could fix this?
ReplicatedCounter
is mainly used to test the KRaft implementation. We should probably consider moving this type to raft/src/test
.
2. We take snapshots at two different places within `QuorumController` - One I was able to figure out the reason being `MaxBytesExceeded`. There is another function `beginWritingSnapshot` which also initiates a snapshot - I was not able to figure out who calls this function and what would the reason be in this scenario?
It looks like this method is only used by tests. I think it is okay for this to have an "unknown reason" for now.
Thank you @jsancio for the review. Have made the changes.
For the following comment -
It looks like this method is only used by tests. I think it is okay for this to have an "unknown reason" for now.
Currently the function in discussion has no logging taking place hence I haven't added the UnknownReason
in. Should I update the function to log the same?
Edit - I have gone ahead and added the log the same way we are logging in the actual function within QuorumController
. I can remove it if necessary.
Hi @jsancio could you please review this PR? Sorry for the multiple changes this has required.
Hi @jsancio could you please review this PR? Sorry for the multiple changes this has required.
Thanks for the changes and excuse the delays. LGTM in general. Restarted the build as it seems the JVM 17 and Scala 2.13 configuration failed to build.
Thank you @jsancio, I see that the build for JVM 17 and JDK 2.13 has timed out again. What can be done from my end to fix this?