raydp
raydp copied to clipboard
Implement shuffle service based on ray object store
Currently, raydp with spark uses the standalone, built-in shuffle service of spark. However, Ray can itself perform shuffling: https://github.com/ray-project/ray/blob/master/python/ray/experimental/shuffle.py
It would be nice to integrate a ray shuffle service with Spark inside RayDP. This would have the following advantages:
- Ray tracks memory usage of the spark application
- Spark actors can be killed any time without loosing persisted data => Number of Spark executors can be dynamically scaled according to the task
- Possibly zero-copy
ray.data.from_spark()
/ds.to_spark()
- Possibly highly improved shuffling performance
This might be a good idea! Thanks for your advice. I have a concern that ray dataset use arrow format while spark dataframe use its own format, though. But we'll look into it.
For zerocopy, it has something to do with the ray serialization layer, which is still under development.