fondant icon indicating copy to clipboard operation
fondant copied to clipboard

Distributed execution

Open RobbeSneyders opened this issue 1 year ago • 7 comments

Fondant currently supports execution across cores on a single machine, which enables workflows using millions of images, but becomes a bottleneck when scaling to tens or hundreds of millions of images.

This issue aims to describe how we can move to execution distributed across multiple machines.


The orchestrators we currently support, or are looking to support in the near future, share the following defining features:

  • They orchestrate docker containers. Data is read and written in each container, and only data locations are passed between containers.
  • Each container is executed on a single machine. The machine to execute on can be chosen per component, but distributed execution of a component across a cluster of machines is not supported out of the box.

Based on this description, we should implement distributed execution as a component that launches a distributed job "somewhere else" and monitors its progress.

This "somewhere else" can be different places:

  • On the Kubernetes cluster where the orchestrator is running by starting multiple kubernetes jobs from the component. (Eg. see these KfP components to start distributed training jobs with different frameworks)
  • On a remote cluster (eg. Dask, Spark) by connecting to an http endpoint and submitting a distributed job.

It seems like the managed orchestrator provide managed versions of these options as well:


To decide how to move forward, I think we need to answer the following questions (among others):


Due to the complexity of this topic, we probably need to create a PoC to understand the possibilities and constraints of the different orchestrators.

### Tasks
- [ ] https://github.com/ml6team/fondant/issues/571

RobbeSneyders avatar Oct 24 '23 14:10 RobbeSneyders