kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-17042: The migration docs should remind users to set "broker.id.generation.enable" when adding broker.id

Open frankvicky opened this issue 1 year ago • 1 comments

In the section: Enter Migration Mode on the Brokers

It requires users to add broker.id, but this can produce an error: "broker.id must be greater than or equal to -1 and not greater than reserved.broker.max.id". This error is caused by the ZooKeeper broker using a generated broker ID.

As this phase is temporary, a simple solution is to remind users to add broker.id.generation.enable=false if the ZooKeeper broker is using a generated broker ID.

Committer Checklist (excluded from commit message)

  • [ ] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

frankvicky avatar Jun 30 '24 11:06 frankvicky

Hi @showuon and @chia7712 I have add some description for broker.id.generation.enable=false, PTAL 😃

frankvicky avatar Jul 01 '24 13:07 frankvicky

Hi @chia7712 , I have do some changes based on feedback, PTAL 😸

frankvicky avatar Jul 01 '24 16:07 frankvicky

I think the main question I have is, what does this relate to migration? When doing ZK migration, that means, the ZK broker is already up and healthy, before starting migrating to KRaft. And when entering in the Enter Migration Mode on the Brokers section, we ask users to add the following 4 configs:

  1. controller.quorum.voters
  2. controller.listener.names
  3. The controller.listener.name should also be added to listener.security.property.map
  4. zookeeper.metadata.migration.enable

Thus, I don't think it will cause the error as described in the JIRA, right? I still don't understand in what circumstance this issue will happen. It might need more explanation. Thanks.

showuon avatar Jul 02 '24 02:07 showuon

Thus, I don't think it will cause the error as described in the JIRA, right? I still don't understand in what circumstance this issue will happen. It might need more explanation. Thanks.

thanks for this response, and it inspires me to dig-in the root cause. The issue happens when the zk broker is using generated broker id and it is in migrating.

  1. if we define the broker id in config file, the broker id is viewed as invalid since it is in the reserved range (when broker.id.generation.enable=true)
  2. if we don't define the broker id in config file, the broker id will be evaluated according to meta.properties [0] . However, we forgot to update node.id and so kraft client will use "node.id=-1" [1] to build and it results in error ..

[0] https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/server/KafkaServer.scala#L261 [1] https://github.com/apache/kafka/blob/e55c28c60b0f021b12c93fa04e360ce7a2e0a5ac/core/src/main/scala/kafka/raft/RaftManager.scala#L230

In short, we don't consider the use case of "generated broker id (the broker id is omitted)" + "zk migration". It seems we need two fixes:

  1. update docs to explain the possible error when you define the broker id for the zk broker which is using broker.id.generation.enable
  2. in order to minimize the changes, we can update both broker.id and node.id together. The side effect is KafkaConfig#nodeId will be a mutable variable.

@showuon WDYT?

chia7712 avatar Jul 02 '24 18:07 chia7712

  1. update docs to explain the possible error when you define the broker id for the zk broker which is using broker.id.generation.enable

If my understanding is correct, the possible error is the broker id will be ignored and the generated ID might exceed the range of max id, which cause the error reported in JIRA. Is that right?

  1. in order to minimize the changes, we can update both broker.id and node.id together. The side effect is KafkaConfig#nodeId will be a mutable variable.

IMO, this should be a bug. We should take the generated ID into node.id. WDYT?

showuon avatar Jul 03 '24 00:07 showuon

@frankvicky , any thoughts about it? I'd also like to hear since you are the author of this PR.

showuon avatar Jul 03 '24 00:07 showuon

If my understanding is correct, the possible error is the broker id will be ignored and the generated ID might exceed the range of max id, which cause the error reported in JIRA. Is that right?

Yep, ignoring the broker id can pass the reserved ids check but it fails on the default node.id. Explicitly defining the "broker.id=generated id" can update node.id to run Kraft components in zk broker but it fails on reserved ids check.

We should take the generated ID into node.id.

Yes. Specifically, We can define node.id too in updating broker.id manually no matter how we get the broker.id.

chia7712 avatar Jul 03 '24 02:07 chia7712

@frankvicky , any thoughts about it? I'd also like to hear since you are the author of this PR.

Hi @showuon ! The reason why broker.id.generation.enable might need to be disabled has been more precisely explained by @chia7712 compared to my previous explanation.

Regarding the second point from @chia7712 , I'm thinking the potential drawbacks of allowing node.id to be mutable.

IMHO, could these changes lead to inconsistencies in cluster coordination and communication? Might they also cause any unpredictable behavior?

frankvicky avatar Jul 03 '24 02:07 frankvicky

Yep, ignoring the broker id can pass the reserved ids check but it fails on the default node.id.

Sorry, I don't understand what you mean here. Enabling the broker.id.generation.enable, we'll Ignore the broker ID config. "But it fails on the default node.id" <-- what does it mean?

Explicitly defining the "broker.id=generated id" can update node.id to run Kraft components in zk broker but it fails on reserved ids check.

"but it fails on reserved ids check" <-- what does it mean? If the ID is successfully generated, it should already be within the reserved range, why do we need another check?

Regarding the second point from @chia7712 , I'm thinking the potential drawbacks of allowing node.id to be mutable.

Currently, if node.id is not set, but broker.id is set, we'll set node.id=broker.id. So, we can add, if broker.id.generation.enable is enabled, and broker.id, node.id not set, we set them. That should fix the problem in this JIRA, right?

IMHO, could these changes lead to inconsistencies in cluster coordination and communication? Might they also cause any unpredictable behavior?

Could you explain more about the inconsistencies in cluster coordination and communication situation? I might miss something. Thanks.

showuon avatar Jul 03 '24 06:07 showuon

I think my main point is, the JIRA said:

it requires users to add "broker.id", but it can produces error "broker.id must be greater than or equal to -1 and not greater than reserved.broker.max.id" too. 
That is caused by the zk broker is using a generated broker id.

the root cause of the issue is the generated ID doesn't pass to node.id, which causes this error when startup, is that correct? If so, why cannot we fix the issue by setting node.id = generated ID, if not set? If not, please kindly let me know why this error will appear? Maybe a reproduce step is helpful. Thanks.

showuon avatar Jul 03 '24 06:07 showuon

But it fails on the default node.id" <-- what does it mean? If the ID is successfully generated, it should already be within the reserved range, why do we need another check?

The story starts with a zk broker having a generated broker id.

Normally, it does not need to define the broker id in the config file. However, that can't work in migration aince Kraft class needs node id and node id is not initialized with (generated) broker id.

If we add the (generated) broker id to config file to initialize both ids, it can't pass the check due to reservation rules.

the root cause of the issue is the generated ID doesn't pass to node.id, which causes this error when startup, is that correct?

You are right and that is the root cause.

? If so, why cannot we fix the issue by setting node.id = generated ID, if not set?

That can fix the issue. The concern is that the node.id will be a mutable variable rather than a value from configs. That opens a dangerous door that enable everyone to change it in everywhere.

Maybe another solution is we can skip the id value check if it is in zk migration. The side effect is that zk broker needs to define broker id in config file for migration even though generated id is enabled.

chia7712 avatar Jul 03 '24 09:07 chia7712

Could you explain more about the inconsistencies in cluster coordination and communication situation? I might miss something. Thanks.

If I remember correctly, KafkaConfig can be changed externally. If node.id becomes a variable, users might unintentionally or intentionally change node.id, which is clearly not very secure.

frankvicky avatar Jul 03 '24 09:07 frankvicky

Thanks for the response. So I think our problem to solve is, when running migration in Enter Migration Mode on the Brokers step, the ZK node "MUST" set broker.id, to avoid node.id set to -1 and fail to startup. @frankvicky , could you please focus on this issue to improve the doc? Thanks.

showuon avatar Jul 03 '24 10:07 showuon

Thanks for @showuon to clarify the root cause. That is a great offline discussion 😃

The inconsistency between node.id and broker.id (when generated id is enabled) can be addressed in another issue. I will file it later

chia7712 avatar Jul 03 '24 10:07 chia7712

Hi @showuon , @chia7712 I have rewrite the description of broker.id to explain why we need this config, PTAL

frankvicky avatar Jul 03 '24 12:07 frankvicky

Hi @showuon I have simplified the description based on feedback, PTAL

frankvicky avatar Jul 03 '24 12:07 frankvicky

The inconsistency between node.id and broker.id (when generated id is enabled) can be addressed in another issue. I will file it later

https://issues.apache.org/jira/browse/KAFKA-17077

chia7712 avatar Jul 03 '24 19:07 chia7712

Hi @showuon I have adjust the description based on your feedback, PTAL

frankvicky avatar Jul 04 '24 04:07 frankvicky

Hi @chia7712 , @showuon I have added a reminder about potential failures, PTAL 🐧

frankvicky avatar Jul 04 '24 11:07 frankvicky