raydp Implement shuffle service based on ray object store

Implement shuffle service based on ray object store

Open Hoeze opened this issue 2 years ago • 1 comments

Currently, raydp with spark uses the standalone, built-in shuffle service of spark. However, Ray can itself perform shuffling: https://github.com/ray-project/ray/blob/master/python/ray/experimental/shuffle.py

It would be nice to integrate a ray shuffle service with Spark inside RayDP. This would have the following advantages:

Ray tracks memory usage of the spark application
Spark actors can be killed any time without loosing persisted data => Number of Spark executors can be dynamically scaled according to the task
Possibly zero-copy ray.data.from_spark() / ds.to_spark()
Possibly highly improved shuffling performance

Oct 04 '22 22:10 Hoeze

This might be a good idea! Thanks for your advice. I have a concern that ray dataset use arrow format while spark dataframe use its own format, though. But we'll look into it.

For zerocopy, it has something to do with the ray serialization layer, which is still under development.

Oct 08 '22 02:10 kira-lin

raydp raydp copied to clipboard

Implement shuffle service based on ray object store

raydp
raydp copied to clipboard