distributed
distributed copied to clipboard
Preserve worker hostname
This PR replaces stalled #4111 (@bbockelm, since your PR has conflicts and it was not anymore easy to rebase on master, I took patch directly from our distributed.git fork). We (cc: @bbockelm @nsmith-) at analysis facility are using this patch for the last half year but it would be nice to have this issue resolved in master...
Original PR description from #4111 :
This provides a configuration option allowing the scheduler to preserve worker hostnames (as opposed to the current & default behavior of resolving hostnames immediately to IP address). Preserving the hostname should convey advantages in two situations:
- Dual-network setups where some hosts have IPv4-only or IPv6-only. This allows the host that wants to communicate with the worker to determine the best network to route over (as opposed to the scheduler).
- Clusters that route TLS connections based on SNI. If the hostname is kept internally, it will be pased to OpenSSL when making a connection. This allows the OpenSSL client to embed the desired hostname in a SNI header in the TLS handshake. If the worker is behind a TLS-router (not entirely uncommon in a Kubernetes setup), then this SNI is required to successfully route the request.
Closes #4070
@oshadura The build is failing due to some linting issue, in particular a formatting issue since we're using black for automatic code formatting. I can recommend installing pre-commit for this.
I like this change from debugging POV alone since hostnames are easier readable but I don't have a good feeling about what the default should be.
From the descriptions of the issue, I have the impression that functionality wise, off by default is sensible but that comes with certain performance penalties which may not be obvious to users. however, with dns caches I'm not sure how severe these penalties actually are.
@jacobtomlinson you've been looking into various deployment scenarios recently. Do you have an idea what the proper choice should be here?
I think that when we looked into this before it was ok/preferable to store hostnames rather than IP addresses. I'm surprised that that was not implemented. I agree that Jacob would be a good person to ask here.
On Mon, Jun 21, 2021 at 4:09 AM Florian Jetter @.***> wrote:
@oshadura https://github.com/oshadura The build is failing due to some linting issue, in particular a formatting issue since we're using black for automatic code formatting. I can recommend installing pre-commit https://pre-commit.com/ for this.
I like this change from debugging POV alone since hostnames are easier readable but I don't have a good feeling about what the default should be.
From the descriptions of the issue, I have the impression that functionality wise, off by default is sensible but that comes with certain performance penalties which may not be obvious to users. however, with dns caches I'm not sure how severe these penalties actually are.
@jacobtomlinson https://github.com/jacobtomlinson you've been looking into various deployment scenarios recently. Do you have an idea what the proper choice should be here?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/pull/4938#issuecomment-864868758, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHKZONIBKZTXQV6Z4LTT36U3ANCNFSM47AEVHHQ .
This seems a reasonable change to me. I would be curious to know the original intention of resolving IPs upfront? It is an unusual thing to do. My guess is that this was some requirement for some HPC system in the early days of Dask? @mrocklin do you know the reasoning?
Originally I think I may have run into systems where DNS lookups took a while. My understanding today is that this is rare.
I thought that we had already fixed this. Fixing it sounds ok to me.
On Mon, Jun 28, 2021 at 6:07 AM Jacob Tomlinson @.***> wrote:
This seems a reasonable change to me. I would be curious to know the original intention of resolving IPs upfront? It is an unusual thing to do. My guess is that this was some requirement for some HPC system in the early days of Dask? @mrocklin https://github.com/mrocklin do you know the reasoning?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/pull/4938#issuecomment-869668395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFV2M2UQECG7CUSMZDTVBXYFANCNFSM47AEVHHQ .
I retriggered the tests to see if they are merely flaky or if there is something off. It appears we all agree and can merge once green-ish
It looks like some of the failures are genuine. Lots of red X's here. Looking briefly at one report I see this test fails
FAILED distributed/tests/test_scheduler.py::test_coerce_address - assert 'tcp...
Which seems like it might be related. Possibly the test is now wrong.
Thanks @mrocklin I checked locally and you are right! I will push a fix..
@mrocklin could I ask please a suggestion what I am doing wrong? With distributed.scheduler.resolve-worker-hostname": True
failing assert should be fine, but even with next line https://github.com/dask/distributed/pull/4938/files#diff-d4f4e24d830a24ab478c8cdce0a8da2357f57d7dbc08a7181c79328db65dbdffR598 it still fails locally for me. Instead if I switch centrally in distributed.yaml
then it works and test passes..
Wondering what the status of this PR is? It seemed very close then went by the wayside.
Can one of the admins verify this patch?