openwhisk
openwhisk copied to clipboard
[New Scheduler] Namespace throttling can prevent specific actions within a namespace from ever running
On the old scheduler, concurrent throttling works such that you can only have x activations in the system at once for your namespace regardless of what actions they are for. If the namespace is getting 429's for concurrency on the old scheduler, so long as the application is continuously retrying those requests the entire workload will eventually process.
On the new scheduler, that isn't necessarily the case. Say you have two functions that depend on one another A -> B. a is high traffic and fans out containers to the concurrency limit for that namespace. Then when b attempts to start running, it can't process any requests since it is namespace throttled and it didn't get any containers before it hit the limit. The workflow of A -> B then deadlocks because A is still receiving requests but the second function can never run. Using openwhisk will now require a user to do much more fine grained capacity planning based on throughput calculations to prevent this from happening whereas prior the user could just depend on openwhisk slowing them down but not halting processing of a specific action. This capacity planning can be hard to do because the scheduler isn't always making optimal decisions on whether to fan out new containers for an action or not, it's best effort so it's hard for a user to plan off that to never breach the threshold.
My initial thought as a short term fix is if there is no space for a namespace and namespace throttling is turned on, if there are 0 containers for that action allow the creation of one container to give it some throughput. Then have a safety configuration of some ratio over the concurrency limit to where it will no longer create a single container for new actions so the user can't game it. i.e. something like if (currentContainersForNamespace > 1.5 * concurrencyLimit) create 0 containers else create 1 container
. Maybe it gets action throttled which is okay but at least it's able to process eventually and prevent deadlocking of inter dependent functions within a namespace.
For a more long term fix, we really should start planning for an action level concurrency limit implementation. Where it's hierarchical between action and namespace. Action can be provisioned with some of the concurrency from its namespace concurrency pool to guarantee that it will always get at least this much concurrency. The current namespace limit only really makes sense for the operator of the system, the user of the namespace should be able to better control the flow of traffic. And I think with the new scheduler, this becomes much more feasible for us to finally implement.