ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] Ray-based concurrent.futures.Executor

Open jeicher opened this issue 2 years ago • 21 comments

Why are these changes needed?

This PR provides RayExecutor, a drop-in replacement for ProcessPoolExecutor and ThreadPoolExecutor from concurrent.futures, which executes tasks on a Ray cluster.

Related issue number

Closes #29456

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [x] Release tests
    • [ ] This PR is not tested :(

jeicher avatar Dec 01 '22 11:12 jeicher

The three failing tests don't seem to have anything to do with our contribution. The Docker Bootstrap test, at the very least, failed because Ubuntu failed to update.

Perhaps someone at Ray could shed some light on the failures?

jeicher avatar Dec 01 '22 14:12 jeicher

sorry for the delay. I will review this tomorrow!

rkooo567 avatar Dec 13 '22 15:12 rkooo567

@rkooo567 can you take another look at this ?

nidabdella avatar Jan 02 '23 13:01 nidabdella

Hmm actually before deep diving into the review, is it currently only support submitting 1 task? It looks like current subclasses of Executor (ProcessExecutor and ThreadPoolExecutor) both has a concept of pool (max_workers). Should we have the same thing?

Thanks @rkooo567! I've refactored quite a bit to make multiple task submission much clearer. The executor now has a max_workers argument which will execute tasks over an actor pool. If that kwarg is not specified, multiple tasks can still be submitted to the same RayExecutor instance, but scheduling will be managed entirely by ray (i.e. no fixed set of actor/workers).

jeicher avatar Jan 04 '23 17:01 jeicher

thanks for the update and sorry for the delay! I am planning to review this tomorrow (have been off around new year!)

rkooo567 avatar Jan 12 '23 14:01 rkooo567

btw there are lots of test failures. Do you mind merging the latest master?

rkooo567 avatar Jan 12 '23 14:01 rkooo567

I made a first pass! Thanks for addressing the initial feedback.

Thanks @rkooo567! We'll get to these comments ASAP.

jeicher avatar Jan 19 '23 13:01 jeicher

Thanks for your prompt replies @rkooo567. I'm busy with a little bit of a refactor again to remove ActorPools, as you suggested (and to clarify the object store issues). This shouldn't take me longer than a few days at most, but I'll ping you here when I've done all of that. 👍🏻

jeicher avatar Jan 31 '23 06:01 jeicher

Sgtm! Let me know! I will follow up with the rest of comments when you get back to me :). Really excited for this contribution!

rkooo567 avatar Feb 07 '23 09:02 rkooo567

Sgtm! Let me know! I will follow up with the rest of comments when you get back to me :). Really excited for this contribution!

I'm busy with this again, shouldn't be long now :D

jeicher avatar Feb 14 '23 15:02 jeicher

@rkooo567 @jjyao I've simplified this PR quite a bit by removing actor pools, and I've responded to the majority of your queries above. It seems to me that the major outstanding issue is dealing with shutdowns.

@rkooo567, you suggested (https://github.com/ray-project/ray/pull/30826#discussion_r1089608649) that

  • if ray is already initialised we don't shut it down (on exiting the context), but
  • if it is not initialised we shut it down.

I'm not sure I understand the flow of logic there. Could you perhaps offer an example of how we get into either of those states with the executor?

@jjyao you suggested (https://github.com/ray-project/ray/pull/30826#discussion_r1090198139) that

  • we simply remove the ray shutdown functionality entirely, because
  • ray will automatically shutdown when the process ends, and
  • users could call ray.shutdown() if they really wanted to.

I think it would be ideal here to hide away any unnecessary ray interactions from the users. So, are there any situations that come to mind in which a user could get into trouble by not realising that there is a ray cluster continuing to run in the background? It feels more like a real drop-in replacement of the other Executor derivations to have no remaining artefacts after the computation finishes.

Thanks for all the reviews so far 😎

jeicher avatar Feb 15 '23 14:02 jeicher

forgot about the PR for a while! I will take a look at it tomorrow!

rkooo567 avatar Mar 15 '23 15:03 rkooo567

Hiyo, just wanted to motivate that this PR would be very much appreciated whenever you find yourself with some more free time!

eddiebergman avatar Mar 28 '23 15:03 eddiebergman

Sorry I was a bit busy recently. I will review it asap. Sorry for the delay!

rkooo567 avatar Apr 05 '23 13:04 rkooo567

@rkooo567 any update on this?

nidabdella avatar Apr 26 '23 09:04 nidabdella

Hi @nidabdella,

Could you reply to the comments I posted above? We can also schedule a meeting to go through the design and make sure we are aligned.

jjyao avatar Apr 26 '23 16:04 jjyao

Hi @nidabdella,

Could you reply to the comments I posted above? We can also schedule a meeting to go through the design and make sure we are aligned.

Thanks @jjyao, I've reverted to an Actor-based model. In the meantime, I've pinged you on Slack. Let's try and organise a chat to make sure we're on the same page.

jeicher avatar May 03 '23 15:05 jeicher

I am a bit occupied lately, so I will rely on @jjyao to finish the review!

rkooo567 avatar May 10 '23 01:05 rkooo567

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Jun 10 '23 04:06 stale[bot]

Hi @jeicher, do you have any updates on this?

jjyao avatar Jun 14 '23 15:06 jjyao

Hi @jeicher, do you have any updates on this?

Hi @jjyao, thanks for checking. I am on leave at the moment so it will be a while before I get to this again.

jeicher avatar Jun 14 '23 19:06 jeicher

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Jul 15 '23 02:07 stale[bot]

Hey guys, I'll be back from leave soon. Hoping to get to this in the near future!

jeicher avatar Jul 17 '23 13:07 jeicher

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

stale[bot] avatar Aug 11 '23 21:08 stale[bot]

Just chiming in to say I love the idea :)

fkaleo avatar Nov 14 '23 13:11 fkaleo

@fkaleo I'll have a finished solution for this very soon 👍🏻

jeicher avatar Nov 14 '23 14:11 jeicher