Ray IPv6 support
Why are these changes needed?
Provides support for IPv6 only and dual-stack IPv4/IPv6.
Inspired by this discussion https://discuss.ray.io/t/lack-of-ipv6-support/2323
IPv6 is much more common than it used to be; in some cases the only available stack is IPv6. Commonly machines come with dual-stack connectivity enabled (IPv4 and IPv6 available at the same time).
I've already reviewed the source code of kuberay. I've determined due to its use of native Go net methods kuberay already supports IPv6.
Design
- See
python/ray/_private/net.py. Centralize networking code to make it easier to support IPv6 alongside IPv4. This is simpler and clarifies where in your network code you need to support this when using low level python sockets. - Favor IPv4 since that is what this project is typically used to; code will fall back to IPv6 if IPv4 is not available.
- It uses dual-stack IPv4 and IPv6 when it is available; for example when listening on a port across all interfaces or determining free ports.
- All places I could find for string parsing IP or host and port have been substituted. With a central method, updates can be made from which all dependent areas will benefit.
Shell code
I notice the following scripts may or may not work with IPv6. They should probably updated using bash parameter expansion.
doc/source/cluster/doc_code/slurm-basic.shdoc/source/cluster/doc_code/slurm-template.sh
If I update them; it will be in a separate PR.
For example, the following bash code properly parses both IPv4 and IPv6 ip:port into two parts.
# IPv6 loopback port 8080
ip_and_port=::1:8080
ip="${ip_and_port%:*}"
port="${ip_and_port##*:}"
# IPv4 loopback port 8080
ip_and_port=127.0.0.1:8080
ip="${ip_and_port%:*}"
port="${ip_and_port##*:}"
Related issue number
I don't know if it will close all of these issues but I did an issues search for ipv6.
- closes #6967
- #43416
- #41966
- #43910
Duplicated by
- https://github.com/ray-project/ray/pull/40332
Checks
Initially, this pull request is in-draft. This is to enable reviewing sooner and manual testing to occur. Automated tests to follow.
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [x] This PR is not tested :(
I guess my issues with staticmethod are best described here https://stackoverflow.com/questions/1859959/static-methods-how-to-call-a-method-from-another-method
After messing around with @staticmethod across different versions of python; the way I used it (on a file of functions without a class definition) worked in Python 3.10 but not Python 3.9. Rather than trying to declare the methods static which would work across multiple versions of python I decided to remove the symbol from the private net module.
This is ready for general testing. All of the automated tests pass except for linting and I somehow need to account for unittest.mock.MagicMock type being passed to the IP address parser.
This is pretty close to leaving draft. I need to write some automated tests.
Pinging @rkooo567 @jjacobj84 for visibility. I appear to have duplicated some effort and I plan to keep iterating on this PR until it is ready.
From the failure
[2024-03-25T03:13:33Z] E ValueError: address type not str or bytes:
[2024-03-25T03:13:33Z] E
[2024-03-25T03:13:33Z] E type: <class 'unittest.mock.MagicMock'>
[2024-03-25T03:13:33Z] E value: <MagicMock name='mock.gcs_address' id='[140017679483136](tel:140017679483136)'>
I don’t want to handle mock classes in application code. How should I handle this?
@samrocketman - we'll look for opp to pair up someone here to help this through; stay tuned next Monday~
@anyscalesam sounds good. I am available to continue working on this just holding per my questions; I know there’s code style issues but I’ll fix them before taking out of draft.
I posted in https://github.com/ray-project/ray/pull/40332 asking author if they mind if I pull in their C/C++ changes. If they don’t answer then I’ll update these files myself with my own take rather than adopting theirs.
been a week @samrocketman > I think safe to pick; just include the OG author and/or reference it in your PR? Let's not wait. My two cents (feel free to disagree)
@samrocketman not pushing but any ETA on when you think you'll open this up for review?
@samrocketman not pushing but any ETA on when you think you'll open this up for review?
I've been sick the past week so likely a week at least; I'm not available over the weekend and I only develop OSS in my spare time. If someone on your team wants to push to my branch I'm okay with that as well.
- The python code is pretty much done. There's an edge case with a mock interfering with my strict type checking; I'll likely have to loosen that check for the mock tests unfortunately.
- The C/C++ code still needs to be updated for gRPC.
Once those two items are addressed it's ready for testing on an actual IPv6 environment.
@anyscalesam worth calling out I've not operated Ray before. I've been making code changes based on core networking and programming concepts/experience.
So when this is ready/out of draft any help in testing this in an ipv6 environment would be appreciated.
The ability to run in an ipv6 only network is also a blocker for us. I would be happy to try and run some tests as well so we can get this merged.
Friendly ping here :)
So I rebased this PR onto master, built the Docker image, and deployed it onto an IPv6-only Kuberentes cluster via KubeRay. This cleared several errors in the Python code and allowed GCS to start... but alas, there still appears to be more IPv4-related tech debt lurking within the CPP source, as revealed by GCS logs when trying to connect to Redis:
2024-08-28 08:09:47,248 I 75 75] (gcs_server) gcs_server.cc:518: Using external Redis for KV storage: redis:6379
[2024-08-28 08:09:47,278 I 75 75] (gcs_server) gcs_server.cc:73: GCS storage type is StorageType::REDIS_PERSIST
[2024-08-28 08:09:47,279 I 75 75] (gcs_server) redis_context.cc:474: Resolve Redis address to fdf5:6da1:fe0d:cc1e::5a9c
[2024-08-28 08:09:47,279 I 75 75] (gcs_server) redis_context.cc:350: Attempting to connect to address fdf5:6da1:fe0d:cc1e::5a9c:6379.
As best as I can determine, the root cause appears to be traceable to RedisAsioClient.
Searching the source code for occurrences of the string "v4" also turns up network_util.cc - specifically, problematic code in CheckFree , which seems likely to break the Raylet WorkerPool. This problem also appears in a few tests.
Searching the source code for occurrences of the strings "127.0.0.1" and "0.0.0.0" also yields several hits - mostly in various tests, but also in (e.g.) GrpcServer, ObjectManager, and NodeManager.
These bugs are blockers to operating Ray within IPv6-only environments.
thanks for doing a pass here @csingley; we'll take a look and get back to you on next steps here
this also becomes a blocker for us after we switched to an ipv6-only k8s environment.
TY for the cont'd feedback; keep it coming. It helps us with the right data signal to prioritize right projects.
I'll look at picking this up again. I made these changes in concept only (I don't currently operate ray but my environment does have ipv6 only clusters).
I can help to test out if we have a testing image for arm64. Thanks for your effort.
@anyscalesam We at Cloudflare would also benefit from IPv6 support.
@xtuc @aspexdaniel @samrocketman can y'all reach out to me on ray.slack.com so I can invite you to the Ray Contributors sync to get some more traction on this?
I can connect you with the Ray Core on-call so we can get a plan of attack together to get this pushed through. Pretty good improvement for the community so I feel well worth doing.
... can y'all reach out to me on ray.slack.com so I can invite you to the Ray Contributors sync to get some more traction on this?
It says I do not have an account and I don't see signup links on the ray website.
I think it makes most sense to do this in two parts:
- These supporting python changes which DRY up socket usage. The intent here is the code was reorganized a little bit and we want to make sure that existing ray functionality works well.
- c++ changes is the last bit of change for IPv6 support. I'm comfortable making the cpp changes but it might make most sense to do it as a separate PR once it is verified the python changes are acceptable.
@xtuc @aspexdaniel @samrocketman can y'all reach out to me on ray.slack.com so I can invite you to the Ray Contributors sync to get some more traction on this?
I can connect you with the Ray Core on-call so we can get a plan of attack together to get this pushed through. Pretty good improvement for the community so I feel well worth doing.
me too can't find a link to join the slack group.
I think it makes most sense to do this in two parts:
- These supporting python changes which DRY up socket usage. The intent here is the code was reorganized a little bit and we want to make sure that existing ray functionality works well.
- c++ changes is the last bit of change for IPv6 support. I'm comfortable making the cpp changes but it might make most sense to do it as a separate PR once it is verified the python changes are acceptable.
as the python codes relies on the c++ changes, will it be possible to submit and test the cpp changes first?
as the python codes relies on the c++ changes, will it be possible to submit and test the cpp changes first?
My intent was to basically DRY up the IPv4/IPv6 network code (without full support for IPv6). The goal being get it nicely organized and ensure I didn't break existing functionality. Then a followup change with the cpp changes with any remaining python changes.
I'm going to take a deeper look at this today I received permission from my place of work to take a look at this during working hours.
Because I have approval to work on this during work hours I'll aim for full IPv6 support in this PR rather than breaking it up over multiple PRs. My primary blocker up to this point is lack of time in off-hours to work on it (I have several projects I contribute to so it's always a bit of a dice roll which ones I update).
@anyscalesam if you want to invite me to your ray slack instance you can find my email in my commit authorship of this PR
We're interested in this support @ Netflix as well (we're also going the way of IPv6-only for k8s). Would be happy to engage on slack if we could be of any help