kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Document how users can use Ray with Kedro

Open crypdick opened this issue 4 years ago • 19 comments

Potential user here. I'm interested in using Kedro, but we use Ray Distributed instead of PySpark for our execution engine. Do your pipelines support this?

crypdick avatar Aug 07 '20 13:08 crypdick

Hi, thank you for your interest for Kedro! We don't natively support Ray DataSet nor RayRunner so far, but you can add both as a custom dataset/runner. See https://kedro.readthedocs.io/en/latest/06_nodes_and_pipelines/02_pipelines.html?#using-a-custom-runner for using a custom runner, and https://kedro.readthedocs.io/en/latest/07_extend_kedro/01_custom_datasets.html for custom dataset.

921kiyo avatar Aug 07 '20 13:08 921kiyo

Update: I confirmed that Ray works fine :)

crypdick avatar Aug 09 '20 17:08 crypdick

Just a quick note on this. @crypdick followed this tutorial from @dataengineerone, except instead of using multiprocessing inside each node he used ray.

Thanks for investigating this @crypdick ✨

yetudada avatar Aug 14 '20 14:08 yetudada

can we add official support for ray?

Harsh-Maheshwari avatar Feb 08 '23 08:02 Harsh-Maheshwari

We happen to have documentation about Kedro on Dask https://kedro.readthedocs.io/en/stable/deployment/dask.html and apparently people have made Kedro on Ray work. Maybe we could reopen this issue and turn it into a documentation one? Happy to work on this in 6 to 8 weeks.

astrojuanlu avatar Feb 08 '23 10:02 astrojuanlu

@crypdick and @astrojuanlu Is it possible that I just add the @ray.remote decorator to my nodes and everything should work without any changes? Because for Ray>1.5 we don't have to specifically call ray.init()

@yetudada The tutorial is very old and not a very scalable way of doing this.

I think with Hooks and some other features from kedro we should be able to do this. I am a beginner in both ray and kedro frameworks. Any help/guidance on an approach to solving this is appreciated

Harsh-Maheshwari avatar Apr 01 '23 20:04 Harsh-Maheshwari

I am not sure @Harsh-Maheshwari , I stopped using Kedro years ago in favor of Metaflow.

crypdick avatar Apr 01 '23 20:04 crypdick

Reopening this as a documentation issue.

astrojuanlu avatar Sep 18 '23 16:09 astrojuanlu

We were asked about this again this week https://linen-slack.kedro.org/t/15736818/is-there-any-docs-that-explains-how-kedro-can-be-integrated-#70729160-84c6-4750-a59f-b3571e2e026b

Maybe time to write this up.

stichbury avatar Sep 29 '23 09:09 stichbury

Yesterday I met @IvanNardini and we discussed that it would be nice to bring this back to life at some point 😃

astrojuanlu avatar Jan 25 '24 09:01 astrojuanlu

@astrojuanlu @stichbury Ray would be a great addition to Kedro, I moved to purely a Ray code base because of difficulties in executing Kedro with Ray. Documentation exploring the integration between the two would help us a lot

Harsh-Maheshwari avatar Jan 25 '24 13:01 Harsh-Maheshwari

Btw here's an early prototype from a hackathon https://github.com/kedro-org/kedro/pull/995

astrojuanlu avatar Feb 09 '24 09:02 astrojuanlu

@Harsh-Maheshwari Could you share what's the difficulties you had using Ray with Kedro? Which part of Ray are you using?

noklam avatar Feb 12 '24 14:02 noklam

Hi @noklam , Sorry for the delayed response

I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers. kedro shouldn't have to manage the scheduling or cluster side of things

Let's say I have a partitioned dataset with 10_000 parquet files. I should be able to start a remote ray cluster and then connect my local Kedro project to that cluster and schedule a pipeline to run on each worker. where each worker is running the same pipeline but on different batches of parquet files and the results are stored in a new partitioned dataset according to my catalog. All of the scheduling and work distribution should be managed by the ray head node

A good-to-have feature would be : Start and then if the system fails for any reason, restart from where we left off

Harsh-Maheshwari avatar Mar 21 '24 06:03 Harsh-Maheshwari

@Harsh-Maheshwari

I am using Ray mostly for distributed computing, In the context of Kedro, We should be able to run a node across various workers

https://github.com/kedro-org/kedro/issues/479#issuecomment-674116048 why does it fails to solve your problem?

Is this a particular problem about ParitionedDataset instead? https://github.com/kedro-org/kedro/issues/1413

I can see that the approach you suggest would work but it's not clear to me why is it better?

noklam avatar Mar 28 '24 12:03 noklam

@noklam

I have just described the use case, right now I am not sure how to integrate ray with kedro

So I don't know if this is the best/only way to do this

what I can say is if the integration is bit more native between kedro and ray then let's say we can even use different auto-scalers in ray for different nodes in kedro

Harsh-Maheshwari avatar Mar 28 '24 12:03 Harsh-Maheshwari

@Harsh-Maheshwari I am no expert of Ray so I need some example to understand what's not working and how Kedro can make this easier.

Maybe the problem is we just need a kedro-ray plugin and nothing should change in Kedro. I will leave it with someone more experience with Ray.

noklam avatar Mar 28 '24 12:03 noklam

In https://github.com/astrojuanlu/workshop-from-zero-to-mlops I described how to execute Kedro pipelines in Ray using Prefect.

An alternative method would be creating a custom runner, but maybe it's good to leave that to an orchestrator instead?

astrojuanlu avatar Jul 21 '24 21:07 astrojuanlu

just adding a comment to A) follow and B) mention that we may look into this in the coming months, mostly because doing batch embeddings in Spark with modern embedding models from huggingface is a pain

pascalwhoop avatar Jul 25 '24 14:07 pascalwhoop