raydp icon indicating copy to clipboard operation
raydp copied to clipboard

Implement shuffle service based on ray object store

Open Hoeze opened this issue 2 years ago • 1 comments

Currently, raydp with spark uses the standalone, built-in shuffle service of spark. However, Ray can itself perform shuffling: https://github.com/ray-project/ray/blob/master/python/ray/experimental/shuffle.py

It would be nice to integrate a ray shuffle service with Spark inside RayDP. This would have the following advantages:

  • Ray tracks memory usage of the spark application
  • Spark actors can be killed any time without loosing persisted data => Number of Spark executors can be dynamically scaled according to the task
  • Possibly zero-copy ray.data.from_spark() / ds.to_spark()
  • Possibly highly improved shuffling performance

Hoeze avatar Oct 04 '22 22:10 Hoeze

This might be a good idea! Thanks for your advice. I have a concern that ray dataset use arrow format while spark dataframe use its own format, though. But we'll look into it.

For zerocopy, it has something to do with the ray serialization layer, which is still under development.

kira-lin avatar Oct 08 '22 02:10 kira-lin