ray
                                
                                 ray copied to clipboard
                                
                                    ray copied to clipboard
                            
                            
                            
                        [core] Add opt-in flag for Windows and OSX clusters, update `ray start` output to match docs
Why are these changes needed?
This PR cleans up a few usability issues around Ray clusters:
- Makes some cleanups to the ray startlog output to match the new documentation on Ray clusters. Mainly, de-emphasize Ray Client and recommend jobs instead.
- Add an opt-in flag for enabling multi-node clusters for OSX and Windows. Previously, it was possible to start a multi-node cluster, but then any Ray programs would fail mysteriously after connecting to the cluster. Now, it will warn the user with an error message if the opt-in flag is not set.
- Document multi-node support for OSX and Windows.
ray start --head output before this PR:
Local node IP: 10.103.212.102
--------------------
Ray runtime started.
--------------------
Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='10.103.212.102:6379'
  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto')
  To connect to this Ray runtime from outside of the cluster, for example to
  connect to a remote cluster from your laptop directly, use the following
  Python code:
    import ray
    ray.init(address='ray://<head_node_ip_address>:10001')
  To see the status of the cluster, use
    ray status
  To monitor and debug Ray, view the dashboard at
    127.0.0.1:8265
  If connection fails, check your firewall settings and network configuration.
  To terminate the Ray runtime, run
    ray stop
After:
Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.103.212.102:6379'
  
  To connect to this Ray cluster, run `ray.init()` as usual:
    import ray
    ray.init()
  
  To connect to this Ray instance from outside of the cluster, for example 
  when connecting to a remote cluster from your laptop, make sure the
  dashboard (127.0.0.1:8265) is accessible and use Ray jobs. For example:
    RAY_ADDRESS='http://<dashboard URL>' ray job submit --working-dir . -- python my_script.py
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on connecting to the Ray cluster from a remote client.
  
  To see the status of the cluster, use
    ray status
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8265
  
  If connection fails, check your firewall settings and network configuration.
  
  To terminate the Ray runtime, run
    ray stop
If on OSX or Windows and RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER is not set:
$ RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=false ray start --head
Local node IP: 127.0.0.1
--------------------
Ray runtime started.
--------------------
Next steps
  Ray clusters are not supported on OSX and Windows.
  If you would like to proceed anyway, restart Ray with:
    ray stop
    RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=true ray start
  
  `RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=true` must also be passed to any Ray clients.
  
  To terminate the Ray runtime, run
    ray stop
$ RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=false python -c "import ray; ray.init()"
2022-12-16 15:41:50,268 INFO worker.py:1356 -- Connecting to existing Ray cluster at address: 127.0.0.1:6379...
2022-12-16 15:43:12,541 WARNING worker.py:1359 -- Ray clusters are not supported on OSX and Windows. If you would like to proceed anyway, rerun with the environment variable `RAY_ENABLE_WINDOWS_OR_OSX_CLUSTER=true`.
2022-12-16 15:41:50,273 INFO worker.py:1545 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
Related issue number
Closes #30770.
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
- [ ] I've run scripts/format.shto lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
 
Thanks for doing this ❤️
Just a small nit: At the moment we have an unholy mix of sometimes 1 being true and sometime "true" being true for environment variables that are flags. It would be good to clean that up going forward (maybe the only way is to allow both 0 and false mean false and 1 and true meaning true for the ones that need clean up, so we are backwards compatible).
It seems at the moment the 0 and 1 convention is more common https://docs.ray.io/en/latest/tune/api_docs/env.html and the other variables in the ray_constants.py file, should we try to standardize around that for now for new environment variables?
The 0 / 1 convention feels a little nicer since there is no problem to decide between "True" and "true" (also I feel like it is the more common convention, but I'm not sure about that).
Add an opt-in flag for enabling multi-node clusters for OSX and Windows
Is there a good reason to document this flag? It seems preferable to raise an exception and just say we do not support this.
Add an opt-in flag for enabling multi-node clusters for OSX and Windows
Is there a good reason to document this flag? It seems preferable to raise an exception and just say we do not support this.
People should be allowed to live dangerously (with a warning of course).
Also.. possibly someone could come along and help make this work for OSX / Windows at some point?
Add an opt-in flag for enabling multi-node clusters for OSX and Windows
Is there a good reason to document this flag? It seems preferable to raise an exception and just say we do not support this.
People should be allowed to live dangerously (with a warning of course).
Also.. possibly someone could come along and help make this work for OSX / Windows at some point?
Yes, this is the reason. We've also had at least two users ask about this on discuss.ray.io, and it seems their only real blocker is #30770.
After reading the messages again, I think there is some potential for confusion about what a cluster is. Could we clarify the message to say "Multi-node Ray clusters"?
Also:
1. I don't think we should print any warning on ray.init()--- this is spammy and probably not actionable if your cluster is already started.
Hmm the problem with this one is that the flag needs to be set on both the cluster and the driver. But actually what we can do here is auto-set the flag on the driver based on whether we are connecting to an existing Ray cluster. I think the original "spammy firewall messages" case is only relevant for ray.init()` without an existing cluster.
2. I think we should raise an error when trying to start a worker node on OSX/Windows without the flag set.
Seems like there's a test_cli failure.
I think this breaks master https://github.com/ray-project/ray/issues/32389