kcp icon indicating copy to clipboard operation
kcp copied to clipboard

feature: limits for workspaces on a shard

Open ntnn opened this issue 5 months ago • 5 comments

Feature Description

shards should have limits for the number workspaces they can hold.

@xrstf did some performance tests and presented his findings in the community call on 2025-07-03. One of the findings is that a shard becomes unusable after receiving too many workspaces due to the garbage collector being constantly triggered. This situation is non-recoverable without manually deleting workspaces in etcd.

Proposed Solution

Add two limits for workspaces on shards:

  • soft limit: When this limit is reached print log messages
  • hard limit: When this limit is reached refuse further workspaces

These limits can default to magic numbers, e.g. 450 for soft limit and 500 for hard limit. Additionally we can try and set an educated guess as a default based on our experience with the average resource consumption, e.g. by calculating the hard limit and setting the soft limit to 90% of the hard limit.

These limits should be configurable by the user.

When the front-proxy chooses a shard to deploy a workspace on shards under the soft limit should be preferred. When no shard under the soft limit is available any shard under the hard limit should be used. If no shard under the hard limit is available workspace creation (possible allow overwriting with the kcp plugin with --force?)

If a user sets a limit to 0 it should be disabled: If the soft limit is set to 0 it is not used when selecting a shard to deploy a workspace on and no messages are logged. If the hard limit is set to 0 shards could (technically) deploy unlimited workspaces and brick the shard. Disabling limits is important to allow users to perform stress testing to test the limits of their setup.

Alternative Solutions

  1. Only soft limit

When front-proxy chooses a shard to deploy a workspace on shards under the soft limit are preferred.

  1. Prevent inoperability due to resource exhaustion/starvation alltogether

Would be overall a better solution as limits can always be set to erroneous values.

Want to contribute?

  • [ ] I would like to work on this issue.

Additional Context

No response

ntnn avatar Jul 03 '25 15:07 ntnn

Personally I'd greatly prefer soft and hard limits. Soft limits as a canary to warn users that they need to scale and hard limits to force users to take action by either scaling or increasing limits.

Users with large deployments will have automated the workspace creation, so they will only see logged messages when they know to look for them in the first place in their logging solution.

ntnn avatar Jul 03 '25 16:07 ntnn

I was also wondering, how a scenario would work where you start with a rootshard, then notice that it's dangerously full, and then add a 2nd shard. The front-proxy will still try to evenly distribute new workspaces across all shards, so the already full rootshard will still grow at 50% speed. And with every additional shard this problem would get worse, no?

xrstf avatar Jul 04 '25 14:07 xrstf

If the root shard is under the soft limit it would continue to grow at 50% until the soft limit, yes. If we are already touching the way workspaces are assigned to shards we could also ensure that they are distributed evenly.

Although "evenly" isn't easy to define - it really depends on how "busy" a workspace is.

I was also thinking if we should plan for a "rebalancing" of workspaces - imagine an installation with three shards should be scaled to ten shards but with less resources. So you would "lock" the old three shards for distribution, add the ten new shards and then trigger rebalancing, migrating workspaces to the new shards.

If we do something like that the limts could be used by e.g.:

hard limit soft limit
> 0 maximum up to this many workspaces preferred for workspaces until this limit
0 does not allow workspaces never preferred for workspaces
<0 limit disabled limit disabled

This way if a shard should be drained of workspaces the hard limit could be set to 0 and shards with a soft limit of 0 could act as an overflow.

ntnn avatar Jul 04 '25 15:07 ntnn

Although "evenly" isn't easy to define - it really depends on how "busy" a workspace is.

I'd however start with the soft/hard limits and think about balancing later.

For that we'd need something similar to the cpu requests for pods, where the user just makes an educated guess about how busy a workspace is for them to adjust as they go along.

ntnn avatar Jul 04 '25 15:07 ntnn

I'm also thinking if it wouldn't make more sense to make a slightly modified version of the controllers we use in the k/k fork that are multicluster aware. That would do away with the memory requirement of 3mb per workspace (probably :D). I think the limits would still be a good idea but we wouldn't have to provide defaults if an external etcd is used.

ntnn avatar Jul 09 '25 08:07 ntnn