kedro Document how users can use Ray with Kedro

Potential user here. I'm interested in using Kedro, but we use Ray Distributed instead of PySpark for our execution engine. Do your pipelines support this?

Aug 07 '20 13:08 crypdick

Hi, thank you for your interest for Kedro! We don't natively support Ray DataSet nor RayRunner so far, but you can add both as a custom dataset/runner. See https://kedro.readthedocs.io/en/latest/06_nodes_and_pipelines/02_pipelines.html?#using-a-custom-runner for using a custom runner, and https://kedro.readthedocs.io/en/latest/07_extend_kedro/01_custom_datasets.html for custom dataset.

Aug 07 '20 13:08 921kiyo

Update: I confirmed that Ray works fine :)

Aug 09 '20 17:08 crypdick

Just a quick note on this. @crypdick followed this tutorial from @dataengineerone, except instead of using multiprocessing inside each node he used ray.

Thanks for investigating this @crypdick ✨

Aug 14 '20 14:08 yetudada

can we add official support for ray?

Feb 08 '23 08:02 Harsh-Maheshwari

We happen to have documentation about Kedro on Dask https://kedro.readthedocs.io/en/stable/deployment/dask.html and apparently people have made Kedro on Ray work. Maybe we could reopen this issue and turn it into a documentation one? Happy to work on this in 6 to 8 weeks.

Feb 08 '23 10:02 astrojuanlu

@crypdick and @astrojuanlu Is it possible that I just add the @ray.remote decorator to my nodes and everything should work without any changes? Because for Ray>1.5 we don't have to specifically call ray.init()

@yetudada The tutorial is very old and not a very scalable way of doing this.

I think with Hooks and some other features from kedro we should be able to do this. I am a beginner in both ray and kedro frameworks. Any help/guidance on an approach to solving this is appreciated

Apr 01 '23 20:04 Harsh-Maheshwari

I am not sure @Harsh-Maheshwari , I stopped using Kedro years ago in favor of Metaflow.

Apr 01 '23 20:04 crypdick

Reopening this as a documentation issue.

Sep 18 '23 16:09 astrojuanlu

We were asked about this again this week https://linen-slack.kedro.org/t/15736818/is-there-any-docs-that-explains-how-kedro-can-be-integrated-#70729160-84c6-4750-a59f-b3571e2e026b

Maybe time to write this up.

Sep 29 '23 09:09 stichbury

Yesterday I met @IvanNardini and we discussed that it would be nice to bring this back to life at some point 😃

Jan 25 '24 09:01 astrojuanlu

@astrojuanlu @stichbury Ray would be a great addition to Kedro, I moved to purely a Ray code base because of difficulties in executing Kedro with Ray. Documentation exploring the integration between the two would help us a lot

Jan 25 '24 13:01 Harsh-Maheshwari

Btw here's an early prototype from a hackathon https://github.com/kedro-org/kedro/pull/995

Feb 09 '24 09:02 astrojuanlu

@Harsh-Maheshwari Could you share what's the difficulties you had using Ray with Kedro? Which part of Ray are you using?

Feb 12 '24 14:02 noklam

Hi @noklam , Sorry for the delayed response

I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers. kedro shouldn't have to manage the scheduling or cluster side of things

Let's say I have a partitioned dataset with 10_000 parquet files. I should be able to start a remote ray cluster and then connect my local Kedro project to that cluster and schedule a pipeline to run on each worker. where each worker is running the same pipeline but on different batches of parquet files and the results are stored in a new partitioned dataset according to my catalog. All of the scheduling and work distribution should be managed by the ray head node

A good-to-have feature would be : Start and then if the system fails for any reason, restart from where we left off

Mar 21 '24 06:03 Harsh-Maheshwari

@Harsh-Maheshwari

I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers

https://github.com/kedro-org/kedro/issues/479#issuecomment-674116048 why does it fails to solve your problem?

Is this a particular problem about ParitionedDataset instead? https://github.com/kedro-org/kedro/issues/1413

I can see that the approach you suggest would work but it's not clear to me why is it better?

Mar 28 '24 12:03 noklam

@noklam

I have just described the use case, right now I am not sure how to integrate ray with kedro

So I don't know if this is the best/only way to do this

what I can say is if the integration is bit more native between kedro and ray then let's say we can even use different auto-scalers in ray for different nodes in kedro

Mar 28 '24 12:03 Harsh-Maheshwari

@Harsh-Maheshwari I am no expert of Ray so I need some example to understand what's not working and how Kedro can make this easier.

Maybe the problem is we just need a kedro-ray plugin and nothing should change in Kedro. I will leave it with someone more experience with Ray.

Mar 28 '24 12:03 noklam

In https://github.com/astrojuanlu/workshop-from-zero-to-mlops I described how to execute Kedro pipelines in Ray using Prefect.

An alternative method would be creating a custom runner, but maybe it's good to leave that to an orchestrator instead?

Jul 21 '24 21:07 astrojuanlu

just adding a comment to A) follow and B) mention that we may look into this in the coming months, mostly because doing batch embeddings in Spark with modern embedding models from huggingface is a pain

Jul 25 '24 14:07 pascalwhoop

kedro kedro copied to clipboard

Document how users can use Ray with Kedro

kedro
kedro copied to clipboard