druid icon indicating copy to clipboard operation
druid copied to clipboard

Launch tasks as k8s jobs in kubernetes environments.

Open churromorales opened this issue 2 years ago • 0 comments

Update

We patched some of our Druid clusters to use k8s jobs instead of having the MM launch them. We are testing internally and I will post a PR up soon. We decided to go the route of getting rid of the MM altogether, the overlord takes over the MM's work.

We decided to get rid of the MM altogether. We created an OverlordK8sExtension which launches the tasks directly from the overlord itself. We did need to make some slight changes to druid-core. Namely directory creation can optionally be done in the task itself, parameters to the CliPeon, and optionally upon task completion the task itself can push reports / task logs. Tasks survive restarts, they are decoupled from the current druid services, thus on restart or shutdown of druid, the tasks will run as k8s jobs until they finish. Thus by definition all tasks are restorable now.

Motivation

Right now we need to know the number of task slots to allocate in advance to configure middle managers appropriately. We could leverage the k8s scheduler to do this to make configurations easier for customers. We could have an upper limit on concurrent tasks, but this would allow the system to become more elastic as when those slots are not being used, those Kubernetes resources originally allocated for the middle manager would be freed up.

Proposed changes

We need to reduce the shared dependency of the middle managers file system for the peon tasks. It is difficult to share files dynamically between a kubernetes container and the application launching it. ConfigMap files do exist for this purpose but there is currently no lifecycle management for a ConfigMap where after a job terminates the ConfigMap is deleted. Additionally many k8s environments have a quota on the number of ConfigMaps one can use. Additionally there are size limitations for how big ConfigMaps can be.

We need to resolve the following:

  1. The TaskLogPusher currently pushes the reports.json file from the middle manager, this needs to move into the task itself. The task running in the k8s job cannot call back to the middle manager anymore.
  2. Saving tasks: (only happen in k8s mode)
    1. Whenever we create intermediate persist segment files, we should also push that to deep storage in a directory specific to the task itself.
    2. When the middle manager writes a restore.json file, that should also be pushed to deep storage (when in k8s mode).
  3. Restoring tasks: (only happens in k8s mode)
    1. When a task starts up, the first thing it does is pull down the intermediate persist files and the restore.json file to a local volume in its own task dir. That way when it starts it behaves the same way it would as if it were running in non-k8s mode.
  4. If the task itself can push segments, I don’t think it is unreasonable for it to push other relevant data.

Rationale

This patch has been proposed by someone in the community: https://github.com/apache/druid/pull/10910 I reviewed this patch and I don’t believe it handles any sort of checkpointing. I also believe it ignores the task report as well. I believe keeping tasks as k8s jobs instead of pods, will allow the k8s scheduler to handle the lifecycle better as well. Right now with the pod based approach, if the middle manager unexpectedly dies, those pods are not cleaned up. While we can utilize some of the work, I think some key features are missing from the way Druid currently works.

Operational impact

Should be none, there will be one or two configuration options to launch tasks as k8s jobs.

Future work

Remove the middle manager dependency completely. Right now the filesystem is tightly coupled between the middle manager and tasks, once we can launch tasks (as k8s jobs) from a middle manager successfully, we can work on removing the middle manager altogether in a subsequent PR.

churromorales avatar Aug 11 '22 23:08 churromorales