ray
ray copied to clipboard
[runtime env] Fix Ray hangs when nonexistent conda environment is specified #28105
Why are these changes needed?
When a conda name is given to the runtime env, we assume the env already exits. However, there are times the env doesn't exist, and if that happens, it hangs forever. This fixes the issue by always checking conda env list before creating a conda runtime env.
Related issue number
Closes https://github.com/ray-project/ray/issues/28105
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s
) in this PR. - [ ] I've run
scripts/format.sh
to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/
under the corresponding.rst
file.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Could you fix the PR title?
By the way, do you happen to know the impact of this on worker startup time? My impression is that conda activate already takes a couple seconds, and this check might take several more seconds depending on how many envs there are, but this performance impact still seems worth the correctness tradeoff.
Hmm actually a good point. Maybe I can make it do only once? Maybe we can create a URI that has the state about it.
I think despite the additional complexity, your URI approach is the correct solution. That's what we do for pip
currently. I'm fine with leaving it as a followup issue, unless you think the performance regression in this PR is unacceptable.
Note that for production use cases, we don't recommend runtime_env
anyway (we recommend using a container image). If the user must use an existing conda environment and they don't need multiple environments, we can recommend they call conda activate
before ray start
.
Yeah I think it is a perf regression if we merge it this way. Let me see how complicated it is to support the URI.
@architkulkarni can you check it again? I addressed the problem ^. I think testing is a bit tough (do we have a test that already has conda env?), so I manually verified it, but lmk if you have an idea for testing.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
- If you'd like to keep this open, just leave any comment, and the stale label will be removed.
Failed tests seem unrelated