fondant Distributed execution

Distributed execution

Open RobbeSneyders opened this issue 1 year ago • 7 comments

Fondant currently supports execution across cores on a single machine, which enables workflows using millions of images, but becomes a bottleneck when scaling to tens or hundreds of millions of images.

This issue aims to describe how we can move to execution distributed across multiple machines.

The orchestrators we currently support, or are looking to support in the near future, share the following defining features:

They orchestrate docker containers. Data is read and written in each container, and only data locations are passed between containers.
Each container is executed on a single machine. The machine to execute on can be chosen per component, but distributed execution of a component across a cluster of machines is not supported out of the box.

Based on this description, we should implement distributed execution as a component that launches a distributed job "somewhere else" and monitors its progress.

This "somewhere else" can be different places:

On the Kubernetes cluster where the orchestrator is running by starting multiple kubernetes jobs from the component. (Eg. see these KfP components to start distributed training jobs with different frameworks)
On a remote cluster (eg. Dask, Spark) by connecting to an http endpoint and submitting a distributed job.

It seems like the managed orchestrator provide managed versions of these options as well:

Jobs
- Vertex AI custom jobs
- Sagemaker distributed feature engineering
Remote cluster
- Vertex AI serverless Spark components
- Sagemaker Spark processing container

To decide how to move forward, I think we need to answer the following questions (among others):

Which of these options allow for a single component implementation that can be executed on different orchestrators (so the component is actually reusable)?
Can we lean on existing frameworks for this? Eg. it seems to be possible to run distributed Dask from our orchestrators:
How much additional infrastructure setup is needed for the different approaches?

Due to the complexity of this topic, we probably need to create a PoC to understand the possibilities and constraints of the different orchestrators.

### Tasks
- [ ] https://github.com/ml6team/fondant/issues/571

Oct 24 '23 14:10 RobbeSneyders

fondant fondant copied to clipboard

Distributed execution

fondant
fondant copied to clipboard