ray icon indicating copy to clipboard operation
ray copied to clipboard

[Doc] add troubleshooting info for ray client

Open scottsun94 opened this issue 1 year ago • 7 comments

Why are these changes needed?

https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1686672067902799

Cover the error message OSS user ran into.

Related issue number

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

scottsun94 avatar Jun 13 '23 22:06 scottsun94

Meta point, why does this file have 6 code owners (and my approval doesn't actually do anything)? I'd be happy to make docs PRs myself as things come up, but this seems like a lot of hoops to jump through.

ckw017 avatar Jun 13 '23 22:06 ckw017

Do you want to include any possible actions to fix the situation?

angelinalg avatar Jun 14 '23 00:06 angelinalg

Do you want to include any possible actions to fix the situation?

I think at the very least it would make sense if I could approve PRs for changes to Ray Client docs in such a way that they can be merged, but that doesn't actually fix the problem because code owners can't approve their own PRs. So in that sense, there probably isn't a way that would fix this problem in the current state bar giving out force merge permissions.

ckw017 avatar Jun 14 '23 01:06 ckw017

^Now that I look at it, I'm not sure if that suggestion was for the code owner thing or for fixing the reconnect issue. For the reconnect issue, there's also not much that can be done about it since the head node is already lost by that point

ckw017 avatar Jun 14 '23 01:06 ckw017

^Now that I look at it, I'm not sure if that suggestion was for the code owner thing or for fixing the reconnect issue. For the reconnect issue, there's also not much that can be done about it since the head node is already lost by that point

RE the reconnect issue, why did the user say " But after re-creating the head pod the its back to normal?" Basically, user needs to re-created the head node or head pod and use ray client to connect to it again?

scottsun94 avatar Jun 14 '23 05:06 scottsun94

I think at the very least it would make sense if I could approve PRs for changes to Ray Client docs in such a way that they can be merged, but that doesn't actually fix the problem because code owners can't approve their own PRs.

RE this issue, @pcmoritz, what's the best way here? It makes sense that Chris should be able to give real approval to PRs related to ray client related stuff because only he reviews the code/doc here.

scottsun94 avatar Jun 14 '23 05:06 scottsun94

RE the reconnect issue, why did the user say " But after re-creating the head pod the its back to normal?" Basically, user needs to re-created the head node or head pod and use ray client to connect to it again?

I'm guessing they just started a new Ray client session (which doesn't rely on state from the previous head pod), which is why it worked.

ckw017 avatar Jun 14 '23 17:06 ckw017