KAFKA-19851 Delete dynamic config that were removed by Kafka
Problem
https://issues.apache.org/jira/browse/KAFKA-19851
When upgrading from Kafka 3.x to 4.0, the metadata log may contain dynamic configurations that were removed in 4.0 (e.g., message.format.version per KIP-724). These removed configs cause InvalidConfigurationException when users attempt to modify any configuration, because validation checks all existing configs including the removed ones.
Implementation
Implement whitelist-based filtering to automatically remove invalid/removed configurations during metadata image building and replay:
-
KafkaConfigSchema.validConfigNames()- Returns whitelist of valid non-internal config names per resource type - Filter in
ConfigurationsDelta- Filter invalid configs when replayingConfigRecordand building final image - Filter in
ConfigurationControlManager- Filter invalid configs during replay to keep controller state clean - Add
GROUPandCLIENT_METRICSConfigDefs toKafkaRaftServerfor complete whitelist
Changes
-
KafkaConfigSchema: AddedvalidConfigNames()method -
ConfigurationsDelta/ConfigurationDelta: Filter invalid configs inreplay()andapply() -
ConfigurationControlManager: Filter invalid configs inreplay() -
KafkaRaftServer: Added GROUP and CLIENT_METRICS ConfigDefs - Tests: Added unit tests verifying filtering behavior
Testing
- Removed configs are filtered from base images and during replay
- Explicitly setting a removed config still triggers
InvalidConfigurationException
I think I agree that it might have less intended side effects to not delete the unknown configs. About the downgrade scenario, I would assume most cases where we're introducing new topic config, it might involve a new MV, in which case we don't guarantee lossless downgrades (this is besides the point that MV downgrade isn't supported yet). We don't technically gate the actual topic config on MV or major versions, so it is quite possible we lose information unexpectedly on a downgrade.
The most simple approach is to not validate dynamic configs that are not known to kafka.
Seems similar to my question on if we can just avoid validating the existing configs and prevent new invalid configs from being added. I don't necessarily agree with allowing a user add a bad config - this could become a vulnerability if we don't have a cap on number of configs
avoid validating the existing configs and prevent new invalid configs from being added
I agree that is a better way.
If the controller is to keep returning an error like it does today in this state, it should return an error that lists all the now-invalid configs so it is straightforward for the user to clean them up.
that way doesn't fix our current issue. the problem is that whenever users add or modify configurations, we throw an exception if there are any invalid configs. Users have to manually remove them all, which is tedious and exactly what we want to improve. Simply informing them about the invalid configs doesn’t really simplify the process, because they still need to clean them up one by one. And they will lose those configs permanently at last.
So based on the discussion above, we'd better stop validating existing configs and preventing users from adding invalid configs.
My only concern is: Is it really the right approach to let Kafka tolerate configurations that should no longer exist in the metadata? If we ever introduce new logic that handles existing configs in the future, we might have to keep adding code paths that explicitly ignore these existing but invalid configs. That seems like it could gradually accumulate technical debt. If we want the metadata to be clean without losing those configs permanently, is it possible we introduce a new config called REMOVED_CONFIG and move all those configs to there?
@ahuang98 @kevin-wu24
Thanks for the discussion @ahuang98 @0xffff-zhiyan:
I don't necessarily agree with allowing a user add a bad config - this could become a vulnerability if we don't have a cap on number of configs
I think I misunderstood your original comment. I agree that if we ignore the existing config metadata state and only validate what is contained in the ALTER_CONFIG request, a given version of kafka's dynamic config will be valid. When going between major versions, the removal from source code of config will not invalidate the existing dynamic config state on the new version of kafka, and allows ALTER_CONFIG to complete. This matches how the static .properties config is validated by kafka.
My only concern is: Is it really the right approach to let Kafka tolerate configurations that should no longer exist in the metadata? If we ever introduce new logic that handles existing configs in the future, we might have to keep adding code paths that explicitly ignore these existing but invalid configs
Dynamic configs that are not known by kafka, just like static configs, shouldn't invalidate the entire config. In this case, they are because ALTER_CONFIG will fail. The argument here is that we should not have been validating the existing dynamic config in the first place, since what is a "valid" (dynamic OR static) configuration depends only on the software version of kafka currently running. If I change software versions, fields in my static .properties file can go from valid -> unknown by kafka, and loading in those unknown configs into KafkaConfig does not throw an exception because they are ignored. We should apply this semantic to the dynamic configuration too.