redpanda
redpanda copied to clipboard
Throttle topic administrative functions
This PR implements KIP-599 Throttle topic administrative functions, namely they are, CreateTopics
, DeleteTopics
and CreatePartitions
requests.
Importantly to note, this KIP doesn't attempt to constrain the number of global partition mutations, cluster wide. The KIP enforces the quota at the client level. We could have deviated from the KIP in this respect, but there is already ongoing work and planning within the multi-tenancy effort to enforce such limits, so we thought it was best to implement the KIP as it originally stands.
Throttling works by using a modified token-bucket algorithm where the tokens
are representative to the number of partition mutations made by one of the three aforementioned requests. The algorithm is modified to accept a given burst
which allows a burst of several mutations with a small or medium number of partitions within the same request to proceed. Subsequent requests will observe an exceeded quota and will return a newly implemented error code THROTTLING_QUOTA_EXCEEDED
back to the client (if the client is new enough) along with the total throttle ms time in the response.
Fixes: #3694
Changes in force-push
- Implemented Bens' feedback, addresses mostly C++ issues
Changes in force-push
- Removed uncessary
cluster
error code - Fixed a bug with the handling of
throttling_quota_exceeded
errors, via the KIPs explaination. With these changes, the 3 supported requests will now respond withthrottling_quota_exceeded
for supported requests AND where the version of the request is new enough. In cases where the client would not understand this error code, the request is handed with the throttle applied.
Changes in force-push && force-push
- Rebased dev
Changes in force-push
- Modified how
throttle_quota_excceeded
is interpreted. Previously all partition mutations were accounted per request and the entire allotment of mutations was allowed to proceed. - Now each request to mutation within a single create_topics, delete_topics or create_partitions request is evaluated one at a time, all requests that exceed the quota are rejected.
- The modified behavior means that if a single request contains X partition mutations some number < X may be allowed to proceed while the remaining will be returned to the client with retryable error
Changes in force-push
- Fixed a bug with flex protocol parsing of types like
[]string
Release Notes
- Adds support for throttling of topic administration functions (create_topics, create_partitions, and delete_topics requests), by way of a new config property
partition_mutation_rate
. When this rate is exceeded per user/shard, the client will be throttled appropriately.
Missing release notes section in PR description (should describe the feature and specify the new config property name)
Importantly to note, this KIP doesn't attempt to constrain the number of global partition mutations, cluster wide. The KIP enforces the quota at the client level.
The KIP enforces on a per-node+per-client basis, right? Whereas this PR is enforcing on a per-shard+per-client basis, which is quite a bit looser. From a superficial reading, I think this only protects us from aggressive clients if the client is using a single connection: if they opened a bunch of connections (that might hit different shards), the per-client limits are less meaningful.
Per-shard enforcement of quotas is fine if the resource we're protecting is also per-shard (cpu time, memory), but in this instance if we're trying to protect our central controller log, I'm not sure this is going to give us as much confidence as we would like. Maybe the enforcement should be done on the controller leader instead? Or perhaps even in both places: a "friendly" check at the kafka level that does this sleep-based throttling, plus a "hard" check on the controller leader that would actually send the quota_exceeded errors if the controller log is being written to too fast.
Importantly to note, this KIP doesn't attempt to constrain the number of global partition mutations, cluster wide. The KIP enforces the quota at the client level.
The KIP enforces on a per-node+per-client basis, right? Whereas this PR is enforcing on a per-shard+per-client basis, which is quite a bit looser. From a superficial reading, I think this only protects us from aggressive clients if the client is using a single connection: if they opened a bunch of connections (that might hit different shards), the per-client limits are less meaningful.
Per-shard enforcement of quotas is fine if the resource we're protecting is also per-shard (cpu time, memory), but in this instance if we're trying to protect our central controller log, I'm not sure this is going to give us as much confidence as we would like. Maybe the enforcement should be done on the controller leader instead? Or perhaps even in both places: a "friendly" check at the kafka level that does this sleep-based throttling, plus a "hard" check on the controller leader that would actually send the quota_exceeded errors if the controller log is being written to too fast.
Good points @jcsp , true that a client could jump around to other shards to end up exceeding the quota, but they could do that now anyway to sidestep the byte level (and soon to be implemented time based) quotas. Agree with your points though.
Doing this enforcement on the leader was considered however it would be impossible to return the estimated calculated throttle response time without heavily modifying the API, which is one reason i went this route. Maybe @dotnwat can chime in here but I think the idea was to have the multi-tenancy effort really provide hard controller partition create rate quotas and KIP-599 would be our effort at providing some throttling up until we have a better solution implemented.
~~Going to draft this as i'm realizing it will really depend on KIP-482 to implement correctly. This is due to the unfortunate fact that the version bumps required in the requests that KIP-599 designates may return the new throttling_quota_exceeded
error is greater then the introduction of the flexibleVersion
.~~
~~We cannot support any request that has a version greater then the flexibleVersion
. Simply attempting to would probably crash as even the parsing of the request handler will differentiate from versions older then the flexibleVersion
.~~
This work has been completed and exists in dev as of this update
Rebased dev
Doing this enforcement on the leader was considered however it would be impossible to return the estimated calculated throttle response time without heavily modifying the API, which is one reason i went this route.
why is it impossible? if enforcement is done at the controller leader then it has some idea about the rate of requests.
Doing this enforcement on the leader was considered however it would be impossible to return the estimated calculated throttle response time without heavily modifying the API, which is one reason i went this route.
why is it impossible? if enforcement is done at the controller leader then it has some idea about the rate of requests.
That comment is pretty old so I don't even remember what I was referring to, but doesn't it already work this way? Since create_topics
, delete_topics
and create_partitions
all are executed on the controller leader, the current running totals are representative of the cluster-wide partition mutation rate.
I think that gets thrown out the window though when the controller changes, since that running total is in-memory data only stored on a single node.
@graphcareful @dotnwat @jcsp @mmaslankaprv @ZeDRoman how does this relate to https://github.com/redpanda-data/core-internal/issues/37?
~Overlapping? Identical? Conflicting? How much coordination do we need here?~ switching to slack for discussion.
@dotnwat in the KIP-599 documentation, its called controller_mutations_rate*
, FWIW I'm up for changing it , i'm not married to the name I've chosen
@dotnwat in the KIP-599 documentation, its called
controller_mutations_rate*
, FWIW I'm up for changing it , i'm not married to the name I've chosen
@graphcareful me neither. can you sync with @ZeDRoman about config names? i think that name is fine but maybe it ends up being name that differentiates control at the cotnroller vs control at the edge?
Restarted CI
In previous force push, I addressed concerns about lifetime of references on the stack within particular lambdas, and i moved the token bucket algorithm into the kafka/
tree to not confuse others with the generic one Roman is working on that will exist in utils/