distributed Preserve worker hostname

This PR replaces stalled #4111 (@bbockelm, since your PR has conflicts and it was not anymore easy to rebase on master, I took patch directly from our distributed.git fork). We (cc: @bbockelm @nsmith-) at analysis facility are using this patch for the last half year but it would be nice to have this issue resolved in master...

Original PR description from #4111 :

This provides a configuration option allowing the scheduler to preserve worker hostnames (as opposed to the current & default behavior of resolving hostnames immediately to IP address). Preserving the hostname should convey advantages in two situations:

Dual-network setups where some hosts have IPv4-only or IPv6-only. This allows the host that wants to communicate with the worker to determine the best network to route over (as opposed to the scheduler).
Clusters that route TLS connections based on SNI. If the hostname is kept internally, it will be pased to OpenSSL when making a connection. This allows the OpenSSL client to embed the desired hostname in a SNI header in the TLS handshake. If the worker is behind a TLS-router (not entirely uncommon in a Kubernetes setup), then this SNI is required to successfully route the request.

Closes #4070

Jun 20 '21 14:06 oshadura

@oshadura The build is failing due to some linting issue, in particular a formatting issue since we're using black for automatic code formatting. I can recommend installing pre-commit for this.

I like this change from debugging POV alone since hostnames are easier readable but I don't have a good feeling about what the default should be.

From the descriptions of the issue, I have the impression that functionality wise, off by default is sensible but that comes with certain performance penalties which may not be obvious to users. however, with dns caches I'm not sure how severe these penalties actually are.

@jacobtomlinson you've been looking into various deployment scenarios recently. Do you have an idea what the proper choice should be here?

Jun 21 '21 09:06 fjetter

I think that when we looked into this before it was ok/preferable to store hostnames rather than IP addresses. I'm surprised that that was not implemented. I agree that Jacob would be a good person to ask here.

On Mon, Jun 21, 2021 at 4:09 AM Florian Jetter @.***> wrote:

@oshadura https://github.com/oshadura The build is failing due to some linting issue, in particular a formatting issue since we're using black for automatic code formatting. I can recommend installing pre-commit https://pre-commit.com/ for this.

I like this change from debugging POV alone since hostnames are easier readable but I don't have a good feeling about what the default should be.

From the descriptions of the issue, I have the impression that functionality wise, off by default is sensible but that comes with certain performance penalties which may not be obvious to users. however, with dns caches I'm not sure how severe these penalties actually are.

@jacobtomlinson https://github.com/jacobtomlinson you've been looking into various deployment scenarios recently. Do you have an idea what the proper choice should be here?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/pull/4938#issuecomment-864868758, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTHKZONIBKZTXQV6Z4LTT36U3ANCNFSM47AEVHHQ .

Jun 21 '21 14:06 mrocklin

This seems a reasonable change to me. I would be curious to know the original intention of resolving IPs upfront? It is an unusual thing to do. My guess is that this was some requirement for some HPC system in the early days of Dask? @mrocklin do you know the reasoning?

Jun 28 '21 13:06 jacobtomlinson

Originally I think I may have run into systems where DNS lookups took a while. My understanding today is that this is rare.

I thought that we had already fixed this. Fixing it sounds ok to me.

On Mon, Jun 28, 2021 at 6:07 AM Jacob Tomlinson @.***> wrote:

This seems a reasonable change to me. I would be curious to know the original intention of resolving IPs upfront? It is an unusual thing to do. My guess is that this was some requirement for some HPC system in the early days of Dask? @mrocklin https://github.com/mrocklin do you know the reasoning?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/pull/4938#issuecomment-869668395, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFV2M2UQECG7CUSMZDTVBXYFANCNFSM47AEVHHQ .

Jun 28 '21 14:06 mrocklin

I retriggered the tests to see if they are merely flaky or if there is something off. It appears we all agree and can merge once green-ish

Jun 29 '21 08:06 fjetter

It looks like some of the failures are genuine. Lots of red X's here. Looking briefly at one report I see this test fails

FAILED distributed/tests/test_scheduler.py::test_coerce_address - assert 'tcp...

Which seems like it might be related. Possibly the test is now wrong.

Jun 29 '21 13:06 mrocklin

Thanks @mrocklin I checked locally and you are right! I will push a fix..

Jun 29 '21 14:06 oshadura

@mrocklin could I ask please a suggestion what I am doing wrong? With distributed.scheduler.resolve-worker-hostname": True failing assert should be fine, but even with next line https://github.com/dask/distributed/pull/4938/files#diff-d4f4e24d830a24ab478c8cdce0a8da2357f57d7dbc08a7181c79328db65dbdffR598 it still fails locally for me. Instead if I switch centrally in distributed.yaml then it works and test passes..

Jun 30 '21 15:06 oshadura

Wondering what the status of this PR is? It seemed very close then went by the wayside.

Dec 07 '21 15:12 abergou

Can one of the admins verify this patch?

Dec 07 '21 15:12 GPUtester

distributed distributed copied to clipboard

Preserve worker hostname

@oshadura https://github.com/oshadura The build is failing due to some linting issue, in particular a formatting issue since we're using black for automatic code formatting. I can recommend installing pre-commit https://pre-commit.com/ for this.

distributed
distributed copied to clipboard