koordinator icon indicating copy to clipboard operation
koordinator copied to clipboard

[proposal] improve ElasticQuota

Open eahydra opened this issue 2 years ago • 0 comments

What is your proposal:

ElasticQuota feature enhancements overview

ElasticQuota is a key feature of the Koordinator project and has been supported since the early days. It is not only compatible with the original ElasticQuota CRD, but has also undergone the following enhancements:

  • Tree Structure Management: Allows resources to be divided by organizational structure or workload type.
  • Weight support: Provides resource quota allocation based on weight, and those with higher weights get more quotas.
  • Fairness Guarantee: Implement a resource quota allocation mechanism that is as fair as possible.

Recently added features

  • NonPreemptible mechanism: Users can mark Pods as non-preemptible to ensure that their resource usage does not exceed the minimum limit (min).
  • Multi Quota Tree: Introduced via ElasticQuotaProfile, allowing the construction of new Quota Trees, defining the maximum resource usage for each tree
  • New statistical dimensions: New Guarantee and Allocated statistical methods to better support elastic resource requirements.

Problems to be solved and optimization directions

  1. Configurability of fairness mechanism #1780 implementby #1839 #1855

    • The fairness mechanism that is turned on by default cannot be turned off at present, and options need to be provided to adapt to special needs such as job scenarios. For example, in the Job scenario, due to the fairness mechanism, multiple jobs may not be able to obtain quotas for the resources that were originally used by one Job due to the fairness mechanism.
  2. Quota Tree integration

    • Integrate Multi Quota Tree with the global default Quota Tree to ensure consistency and simplify management. The global default Quota Tree is just a special case of MultiQuotaTree, which contains the resources of all nodes in the cluster.
  3. Make resource request verification clear

    • Clarify the verification logic between Pod resource requests and ElasticQuota boundaries for different scenarios. In the early design, you only need to pay attention to whether the sum of Pod Request and Used is less than runtime and max. The runtime may be greater than or equal to min; the NonPreemptible mechanism was introduced later. The upper bound of the available quota of non-preemptible Pods is min, and the upper bound of the available quota of preemptible Pods is runtime and max. There are some more complex scenarios here, such as Non-preemptible Pods and preemptible Pods can be counted used together, or statistics need to be allocated separately; MultiQuotaTree has introduced two dimensions: Guarantee and Allocated. These two newly added statistical dimensions solve resource reservations in on-demand elastic scenarios. problem, but also affects the upper bound of the Pod's available resources. Therefore, we need to clarify the verification methods for these different scenarios to ensure they are interpretable.
  4. Remove Special Quota

    • Remove Default Quota and System Quota, replace them with APIs and common capabilities, and automatically inject QuotaName for Pods that do not specify QuotaName. There is special logic regarding these two Quotas in the current design and implementation. On the whole, this exception is unreasonable. For example, the scheduler will specifically create these two ElasticQuota objects at startup. These two objects also need to be additionally considered when calculating fairness. It also allows Pods that do not declare associated Quota to use DefaultQuota by default, and specially writes some code (such as migrateDefaultQuotaGroupsPod) is responsible for revising the status of DefaultQuota. From another perspective, these special cases can be expressed as APIs and used as a general capability; for Pods that do not declare a QuotaName, we can inject a quota name into the pod based on the ClusterColocationProfile mechanism, and the Quota pointed to is actually a DefaultQuota.
  5. Preemption strategy enhancement #1840 #1879

    • Supports job granularity-based preemption mechanism to optimize resource allocation.

Code quality improvements

  • Improve the readability and maintainability of ElasticQuota's core code and other parts to support the continued healthy development of the project.

eahydra avatar Jan 13 '24 13:01 eahydra