mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

Distributed Processing in any way?

Open LeaveNhA opened this issue 1 year ago • 5 comments

Hello, as you might know, I'm admiring your works (all of you guys, all the contributors) and love our community.

Apart from this start, here is my simple question:

Is there any plan to make it distributed or can I use any already written library/frameworks that I can use for this purpose?

Thank you, Good luck!

LeaveNhA avatar Jan 25 '24 14:01 LeaveNhA

Could you say more about what you are looking for? Distributed is a pretty generic term.

What exactly would you like to distribute? Training / inference? At what granularity? Any specific example?

awni avatar Jan 25 '24 14:01 awni

Let's assume I have a couple of high-end apple device on my home. I wanna use them together to generate more processing power. Both for training and inference.

LeaveNhA avatar Jan 29 '24 16:01 LeaveNhA

We could be referring to solutions like Spark or Dask (with local, Kubernetes, MPI, and other backends) for distributed data processing (that could eventually implement ML specifics, such as Dask-ML) or something very specific for ML from starting point like PyTorch and TensorFlow distributed training features.

Implementation examples

Dask

Set up

The setup depends on backend. In this example, we are defining a local cluster.

Start manager (scheduler):

dask scheduler

Start workers:

dask worker <dask-scheduler-address>

The cluster can also be created and managed using Python:

from dask.distributed import LocalCluster


# The cluster object allows to scale the number of workers remotely
cluster = LocalCluster(n_workers=1)
cluster.scale(2)

The workers need to discover the manager in the network and share access to resources such as files and data sources.

Usage

Standalone:

import dask.dataframe as dd


df = dd.read_csv(...)
df.x.sum().compute()

Local cluster:

from dask.distributed import Client
import dask.dataframe as dd


client = Client('<dask-scheduler-address>')

# Use the default client object from runtime
df = dd.read_csv(...)
df.x.sum().compute()

ML usage

Local cluster:

from dask.distributed import Client
import dask.dataframe as dd

from dask_ml.cluster import KMeans


client = Client('<dask-scheduler-address>')

# Use the default client object from runtime
df = dd.read_csv(...)

kmeans = KMeans(n_clusters=3, init_max_iter=2, oversampling_factor=10)
kmeans.fit(df.to_dask_array(lengths=True))
kmeans.predict(df.head()).compute()

We could implement a MLX collection backend (mlx-dask) for Dask: https://docs.dask.org/en/latest/how-to/selecting-the-collection-backend.html#defining-a-new-collection-backend

danilopeixoto avatar Jan 30 '24 12:01 danilopeixoto

Man, you really enlighten the path. Yes, we can. Give me a couple of days to read the documentation and implementation details.

I have a question in advance, how far we have to dive in order to implement such a backend? Can we find best practises or can we get ideas of from other backends (I assume they have other backends).

Thank you, trully, Sincerely.

LeaveNhA avatar Jan 31 '24 12:01 LeaveNhA

@LeaveNhA, I encountered challenges while working on a prototype with the Dask Backend Entrypoint API:

  • The MLX data type is not an alias of np.dtype as expected by Dask.
  • There could be additional compatibility issues with the MLX Random module.
  • It's important to note that the Dask Backend Entrypoint API is still in an experimental phase.

We could explore alternative methods such as Dask Custom Collection, Dask Delayed and Dask Futures to implement distributed computations.

Ray is also an interesting option to explore.

danilopeixoto avatar Feb 14 '24 02:02 danilopeixoto

I think, while we are busy, this problem solved, right?

LeaveNhA avatar Jul 25 '24 02:07 LeaveNhA

It seems the MLX team added MPI distributed training support!

danilopeixoto avatar Jul 25 '24 05:07 danilopeixoto

We did indeed: https://ml-explore.github.io/mlx/build/html/usage/distributed.html

I think we can close this and open more targeted issues related to distributed models as they come up.

awni avatar Jul 25 '24 13:07 awni