kedro
kedro copied to clipboard
Document how users can use Ray with Kedro
Potential user here. I'm interested in using Kedro, but we use Ray Distributed instead of PySpark for our execution engine. Do your pipelines support this?
Hi, thank you for your interest for Kedro! We don't natively support Ray DataSet nor RayRunner so far, but you can add both as a custom dataset/runner. See https://kedro.readthedocs.io/en/latest/06_nodes_and_pipelines/02_pipelines.html?#using-a-custom-runner for using a custom runner, and https://kedro.readthedocs.io/en/latest/07_extend_kedro/01_custom_datasets.html for custom dataset.
Update: I confirmed that Ray works fine :)
Just a quick note on this. @crypdick followed this tutorial from @dataengineerone, except instead of using multiprocessing
inside each node he used ray
.
Thanks for investigating this @crypdick ✨
can we add official support for ray?
We happen to have documentation about Kedro on Dask https://kedro.readthedocs.io/en/stable/deployment/dask.html and apparently people have made Kedro on Ray work. Maybe we could reopen this issue and turn it into a documentation one? Happy to work on this in 6 to 8 weeks.
@crypdick and @astrojuanlu
Is it possible that I just add the @ray.remote
decorator to my nodes and everything should work without any changes?
Because for Ray>1.5 we don't have to specifically call ray.init()
@yetudada The tutorial is very old and not a very scalable way of doing this.
I think with Hooks and some other features from kedro we should be able to do this. I am a beginner in both ray and kedro frameworks. Any help/guidance on an approach to solving this is appreciated
I am not sure @Harsh-Maheshwari , I stopped using Kedro years ago in favor of Metaflow.
Reopening this as a documentation issue.
We were asked about this again this week https://linen-slack.kedro.org/t/15736818/is-there-any-docs-that-explains-how-kedro-can-be-integrated-#70729160-84c6-4750-a59f-b3571e2e026b
Maybe time to write this up.
Yesterday I met @IvanNardini and we discussed that it would be nice to bring this back to life at some point 😃
@astrojuanlu @stichbury Ray would be a great addition to Kedro, I moved to purely a Ray code base because of difficulties in executing Kedro with Ray. Documentation exploring the integration between the two would help us a lot
Btw here's an early prototype from a hackathon https://github.com/kedro-org/kedro/pull/995
@Harsh-Maheshwari Could you share what's the difficulties you had using Ray with Kedro? Which part of Ray are you using?
Hi @noklam , Sorry for the delayed response
I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers. kedro shouldn't have to manage the scheduling or cluster side of things
Let's say I have a partitioned dataset with 10_000 parquet files. I should be able to start a remote ray cluster and then connect my local Kedro project to that cluster and schedule a pipeline to run on each worker. where each worker is running the same pipeline but on different batches of parquet files and the results are stored in a new partitioned dataset according to my catalog. All of the scheduling and work distribution should be managed by the ray head node
A good-to-have feature would be : Start and then if the system fails for any reason, restart from where we left off
@Harsh-Maheshwari
I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers
https://github.com/kedro-org/kedro/issues/479#issuecomment-674116048 why does it fails to solve your problem?
Is this a particular problem about ParitionedDataset instead? https://github.com/kedro-org/kedro/issues/1413
I can see that the approach you suggest would work but it's not clear to me why is it better?
@noklam
I have just described the use case, right now I am not sure how to integrate ray with kedro
So I don't know if this is the best/only way to do this
what I can say is if the integration is bit more native between kedro and ray then let's say we can even use different auto-scalers in ray for different nodes in kedro
@Harsh-Maheshwari I am no expert of Ray so I need some example to understand what's not working and how Kedro can make this easier.
Maybe the problem is we just need a kedro-ray plugin and nothing should change in Kedro. I will leave it with someone more experience with Ray.
In https://github.com/astrojuanlu/workshop-from-zero-to-mlops I described how to execute Kedro pipelines in Ray using Prefect.
An alternative method would be creating a custom runner, but maybe it's good to leave that to an orchestrator instead?
just adding a comment to A) follow and B) mention that we may look into this in the coming months, mostly because doing batch embeddings in Spark with modern embedding models from huggingface is a pain