dask-kubernetes
dask-kubernetes copied to clipboard
Document operator workflows
There are various workflows that the operator enables. Some steps are carried out by the user (creating daskcluster
resources), some things are done by the operator (creating pods/services), some things are recursive (the operator creates worker groups, then the operator creates pods for those worker groups) and other things are done by kubernetes itself (deleting a daskcluster
causes kubernetes to cascade delete the worker groups, pods, services, etc).
It would be good to document all of these things. Something like this.
Installation
- user installs new dask cluster and worker group resource types
- user installs operator daemon
Cluster creation
- User creates cluster resource
- operator notices cluster resource and creates scheduler pod/service and worker group resource
- operator notices worker group resource and creates worker pods
Cluster scaling
- User modifies cluster worker count
- operator notices change and starts/stops pods to match
Cluster adaptive mode
- User toggles adaptive setting on the cluster resource
- operator begins polling the scheduler for desired number of workers and adjusts the worker count on the cluster resource to match (triggering the scaling workflow when it changes)
Cluster deletion
- User deletes cluster resource
- Kubernetes cascade deletes all child resources including worker groups, pods and services
Create additional worker groups for heterogenous clusters
- User creates new worker group resource (with different resources to the default like GPUs or high memory)
- operator notices new worker group and creates pods
- operator adopts the worker group resource to the cluster resource so that it will also be cascade deleted
This could also be a fun time to try out mermaid diagrams.