ray icon indicating copy to clipboard operation
ray copied to clipboard

[runtime env] Fix Ray hangs when nonexistent conda environment is specified #28105

Open rkooo567 opened this issue 1 year ago • 6 comments

Why are these changes needed?

When a conda name is given to the runtime env, we assume the env already exits. However, there are times the env doesn't exist, and if that happens, it hangs forever. This fixes the issue by always checking conda env list before creating a conda runtime env.

Related issue number

Closes https://github.com/ray-project/ray/issues/28105

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

rkooo567 avatar May 02 '23 15:05 rkooo567

Could you fix the PR title?

jjyao avatar May 02 '23 17:05 jjyao

By the way, do you happen to know the impact of this on worker startup time? My impression is that conda activate already takes a couple seconds, and this check might take several more seconds depending on how many envs there are, but this performance impact still seems worth the correctness tradeoff.

Hmm actually a good point. Maybe I can make it do only once? Maybe we can create a URI that has the state about it.

rkooo567 avatar May 02 '23 22:05 rkooo567

I think despite the additional complexity, your URI approach is the correct solution. That's what we do for pip currently. I'm fine with leaving it as a followup issue, unless you think the performance regression in this PR is unacceptable.

Note that for production use cases, we don't recommend runtime_env anyway (we recommend using a container image). If the user must use an existing conda environment and they don't need multiple environments, we can recommend they call conda activate before ray start.

architkulkarni avatar May 02 '23 23:05 architkulkarni

Yeah I think it is a perf regression if we merge it this way. Let me see how complicated it is to support the URI.

rkooo567 avatar May 03 '23 05:05 rkooo567

@architkulkarni can you check it again? I addressed the problem ^. I think testing is a bit tough (do we have a test that already has conda env?), so I manually verified it, but lmk if you have an idea for testing.

rkooo567 avatar May 03 '23 06:05 rkooo567

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Jun 10 '23 05:06 stale[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale[bot] avatar Aug 10 '23 05:08 stale[bot]

Failed tests seem unrelated

rkooo567 avatar Aug 23 '23 13:08 rkooo567