ray [Core] Ray-based concurrent.futures.Executor

Why are these changes needed?

This PR provides RayExecutor, a drop-in replacement for ProcessPoolExecutor and ThreadPoolExecutor from concurrent.futures, which executes tasks on a Ray cluster.

Related issue number

Closes #29456

Checks

[x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
[x] I've run scripts/format.sh to lint the changes in this PR.
[x] I've included any doc changes needed for https://docs.ray.io/en/master/.
[x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [x] Unit tests
- [x] Release tests
- [ ] This PR is not tested :(

Dec 01 '22 11:12 jeicher

The three failing tests don't seem to have anything to do with our contribution. The Docker Bootstrap test, at the very least, failed because Ubuntu failed to update.

Perhaps someone at Ray could shed some light on the failures?

Dec 01 '22 14:12 jeicher

sorry for the delay. I will review this tomorrow!

Dec 13 '22 15:12 rkooo567

@rkooo567 can you take another look at this ?

Jan 02 '23 13:01 nidabdella

Hmm actually before deep diving into the review, is it currently only support submitting 1 task? It looks like current subclasses of Executor (ProcessExecutor and ThreadPoolExecutor) both has a concept of pool (max_workers). Should we have the same thing?

Thanks @rkooo567! I've refactored quite a bit to make multiple task submission much clearer. The executor now has a max_workers argument which will execute tasks over an actor pool. If that kwarg is not specified, multiple tasks can still be submitted to the same RayExecutor instance, but scheduling will be managed entirely by ray (i.e. no fixed set of actor/workers).

Jan 04 '23 17:01 jeicher

thanks for the update and sorry for the delay! I am planning to review this tomorrow (have been off around new year!)

Jan 12 '23 14:01 rkooo567

btw there are lots of test failures. Do you mind merging the latest master?

Jan 12 '23 14:01 rkooo567

I made a first pass! Thanks for addressing the initial feedback.

Thanks @rkooo567! We'll get to these comments ASAP.

Jan 19 '23 13:01 jeicher

Thanks for your prompt replies @rkooo567. I'm busy with a little bit of a refactor again to remove ActorPools, as you suggested (and to clarify the object store issues). This shouldn't take me longer than a few days at most, but I'll ping you here when I've done all of that. 👍🏻

Jan 31 '23 06:01 jeicher

Sgtm! Let me know! I will follow up with the rest of comments when you get back to me :). Really excited for this contribution!

Feb 07 '23 09:02 rkooo567

Sgtm! Let me know! I will follow up with the rest of comments when you get back to me :). Really excited for this contribution!

I'm busy with this again, shouldn't be long now :D

Feb 14 '23 15:02 jeicher

@rkooo567 @jjyao I've simplified this PR quite a bit by removing actor pools, and I've responded to the majority of your queries above. It seems to me that the major outstanding issue is dealing with shutdowns.

@rkooo567, you suggested (https://github.com/ray-project/ray/pull/30826#discussion_r1089608649) that

if ray is already initialised we don't shut it down (on exiting the context), but
if it is not initialised we shut it down.

I'm not sure I understand the flow of logic there. Could you perhaps offer an example of how we get into either of those states with the executor?

@jjyao you suggested (https://github.com/ray-project/ray/pull/30826#discussion_r1090198139) that

we simply remove the ray shutdown functionality entirely, because
ray will automatically shutdown when the process ends, and
users could call ray.shutdown() if they really wanted to.

I think it would be ideal here to hide away any unnecessary ray interactions from the users. So, are there any situations that come to mind in which a user could get into trouble by not realising that there is a ray cluster continuing to run in the background? It feels more like a real drop-in replacement of the other Executor derivations to have no remaining artefacts after the computation finishes.

Thanks for all the reviews so far 😎

Feb 15 '23 14:02 jeicher

forgot about the PR for a while! I will take a look at it tomorrow!

Mar 15 '23 15:03 rkooo567

Hiyo, just wanted to motivate that this PR would be very much appreciated whenever you find yourself with some more free time!

Mar 28 '23 15:03 eddiebergman

Sorry I was a bit busy recently. I will review it asap. Sorry for the delay!

Apr 05 '23 13:04 rkooo567

@rkooo567 any update on this?

Apr 26 '23 09:04 nidabdella

Hi @nidabdella,

Could you reply to the comments I posted above? We can also schedule a meeting to go through the design and make sure we are aligned.

Apr 26 '23 16:04 jjyao

Hi @nidabdella,

Could you reply to the comments I posted above? We can also schedule a meeting to go through the design and make sure we are aligned.

Thanks @jjyao, I've reverted to an Actor-based model. In the meantime, I've pinged you on Slack. Let's try and organise a chat to make sure we're on the same page.

May 03 '23 15:05 jeicher

I am a bit occupied lately, so I will rely on @jjyao to finish the review!

May 10 '23 01:05 rkooo567

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Jun 10 '23 04:06 stale[bot]

Hi @jeicher, do you have any updates on this?

Jun 14 '23 15:06 jjyao

Hi @jeicher, do you have any updates on this?

Hi @jjyao, thanks for checking. I am on leave at the moment so it will be a while before I get to this again.

Jun 14 '23 19:06 jeicher

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Jul 15 '23 02:07 stale[bot]

Hey guys, I'll be back from leave soon. Hoping to get to this in the near future!

Jul 17 '23 13:07 jeicher

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

Aug 11 '23 21:08 stale[bot]

Just chiming in to say I love the idea :)

Nov 14 '23 13:11 fkaleo

@fkaleo I'll have a finished solution for this very soon 👍🏻

Nov 14 '23 14:11 jeicher

ray ray copied to clipboard

[Core] Ray-based concurrent.futures.Executor

Why are these changes needed?

Related issue number

Checks

ray
ray copied to clipboard