dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

Document operator workflows

Open jacobtomlinson opened this issue 2 years ago • 0 comments

There are various workflows that the operator enables. Some steps are carried out by the user (creating daskcluster resources), some things are done by the operator (creating pods/services), some things are recursive (the operator creates worker groups, then the operator creates pods for those worker groups) and other things are done by kubernetes itself (deleting a daskcluster causes kubernetes to cascade delete the worker groups, pods, services, etc).

It would be good to document all of these things. Something like this.

Installation

  • user installs new dask cluster and worker group resource types
  • user installs operator daemon

Cluster creation

  • User creates cluster resource
  • operator notices cluster resource and creates scheduler pod/service and worker group resource
  • operator notices worker group resource and creates worker pods

Cluster scaling

  • User modifies cluster worker count
  • operator notices change and starts/stops pods to match

Cluster adaptive mode

  • User toggles adaptive setting on the cluster resource
  • operator begins polling the scheduler for desired number of workers and adjusts the worker count on the cluster resource to match (triggering the scaling workflow when it changes)

Cluster deletion

  • User deletes cluster resource
  • Kubernetes cascade deletes all child resources including worker groups, pods and services

Create additional worker groups for heterogenous clusters

  • User creates new worker group resource (with different resources to the default like GPUs or high memory)
  • operator notices new worker group and creates pods
  • operator adopts the worker group resource to the cluster resource so that it will also be cascade deleted

This could also be a fun time to try out mermaid diagrams.

jacobtomlinson avatar May 05 '22 16:05 jacobtomlinson