ArmoniK should allow for grid partitionning
An on-premises ArmoniK deployment should allow multiple application to share the same infrastructure while preserving reserved capacity. When all applications have jobs running, each should access its reserved capacity. When an application doesn't use all its cores, the available cores should be used to process jobs from the other applications. If the former application then submits a new job, a time slot will be given for tasks to finish. After this time slot, the cores corresponding to the reserved capacity will be preempted.
Let AppA, AppB and AppC be three application that share the same infrastructure. AppA has a reserved capacity of 500 cores, AppB has 200 cores and AppC has 300 cores. Preemption time is set to a 3-6 minutes interval. Workers are configured to consume one core.
- T0: all applications are running. AppA has 500 workers, AppB uses 200 workers and AppC uses 300 workers.
- T1: AppB finished processing its job. Its worker become idle and they randomly pick a task from another application. The available workers will consequently process tasks form the other applications: AppA uses its 500 workers and uses 94 AppB workers while AppC uses its 300 workers and 106 AppB workers. Every 3 minutes the 200 AppB workers checks if new task from AppB are available.
- T2: A new job of 150 tasks is submitted by AppB. All worker from this application are busy. Every 3 minutes AppB worker checks if a task is available, the second time that it finds an available task, it stops the running task from AppA or AppC and start a new AppB task.
Tasks:
- [x] https://github.com/aneoconsulting/ArmoniK.Core/pull/114
- [x] #363
- [x] metrics exporter for multiple partitions https://github.com/aneoconsulting/ArmoniK.Core/pull/119
- [x] one queue per partition #432
- [x] one replica set per partition (static) #400
- [x] #436 HPA for multiple replica sets (no preemption)
- [x] #362
- [x] Partition constraint manager
- [x] #462
- [ ] PartitionId list in Properties (UnifiedAPI)
- [ ] Let user change TaskOptions at Submit
- [ ] Partition manager service #442
- [ ] Partition scaler #443
- [ ] Grid partition can be configured on an hourly basis #44
- [ ] Add documentation
Each application will have its own dedicated replica set. They will use the same pod definition but with different configuration. The repartition of the workers will be done by providing the proper sizes to the replica sets.
The sizes of the replica sets will be changed anytime given a simple kubectl command. A cron will allow to automatize that.
https://hackmd.io/5lGsWI3WRw28OTOOGtbkHA