distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Add scheduler-sni option for dask workers

Open holzman opened this issue 2 years ago • 15 comments

This adds support for dask workers to specify the SNI host separately from the TLS endpoint. In some environments (such as a Kubernetes cluster running dask-gateway), the schedulers may be behind a TLS proxy, while the workers are external to the cluster. This allows workers to connect to the schedulers without the need for wildcard DNS.

Signed-off-by: Burt Holzman [email protected]

Closes #6289

  • [x] Tests added / passed
  • [x] Passes pre-commit run --all-files

holzman avatar May 05 '22 20:05 holzman

Can one of the admins verify this patch?

GPUtester avatar May 05 '22 20:05 GPUtester

Unit Test Results

       16 files  +       4         16 suites  +4   7h 16m 54s :stopwatch: + 2h 5m 5s   2 768 tests +       6    2 690 :heavy_check_mark: +     20       78 :zzz:  -   13  0 :x:  - 1  22 106 runs  +5 560  21 087 :heavy_check_mark: +5 341  1 019 :zzz: +225  0 :x:  - 6 

Results for commit 9da242fc. ± Comparison against base commit 70e5c908.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar May 06 '22 00:05 github-actions[bot]

I realized I left off the websockets implementation; I'll squash my commit on that and force-push momentarily.

holzman avatar May 06 '22 22:05 holzman

Thank you for the PR @holzman . I apologize for the delay in a response. @jacobtomlinson is this something that you or someone around you can review?

mrocklin avatar May 13 '22 19:05 mrocklin

It's been a few weeks - before I rebase against HEAD, any issues with merging this?

holzman avatar Jun 02 '22 14:06 holzman

I think we were waiting on @jcrist for input but we are probably fine to move ahead. This seems fine and if it is inspired by dask-gateway then it's a reasonable change.

Please feel free to merge in from main and resolve conflicts, we squash on merge anyway so don't stress about rebasing, squashing, force pushing, etc. It actually makes review harder when folks do that as we can't see the history of the PR.

jacobtomlinson avatar Jun 14 '22 10:06 jacobtomlinson

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 51m 44s :stopwatch: + 21m 4s   3 006 tests +  6    2 915 :heavy_check_mark: +  4       89 :zzz: ±0  2 :x: +2  22 288 runs  +42  21 234 :heavy_check_mark: +39  1 052 :zzz: +1  2 :x: +2 

For more details on these failures, see this check.

Results for commit 47d6dd62. ± Comparison against base commit 969aa463.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jun 14 '22 16:06 github-actions[bot]

Hi @jacobtomlinson: it's been a few more weeks. Please let me know if there's anything else I need to do to help keep this moving.

holzman avatar Jul 07 '22 19:07 holzman

@holzman apologies on the delay here. Maintainers here have been trying to dig out from some unpleasant deadlocking with the scheduler recently. I'm apprehensive to merge this change without input from the folks with deeper experience looking into that. @crusaderky and @florian-jetter-by are the best folks to merge this but are both out currently.

Perhaps @graingert or @gjoseph92 could also take a look and give a review here?

jacobtomlinson avatar Jul 14 '22 15:07 jacobtomlinson

It's been another month since any activity on this. As always let me know if there's anything I can do to help move this along.

holzman avatar Aug 15 '22 14:08 holzman

It's been another month since any activity on this. As always let me know if there's anything I can do to help move this along.

Ah sorry for dropping the ball on this

graingert avatar Aug 15 '22 14:08 graingert

@graingert - did my comment https://github.com/dask/distributed/pull/6290#discussion_r946004112 make sense? Depending on the configuration, the communication may not be using TLS, so we can't call start_tls for every invocation, so we need two code paths here.

holzman avatar Aug 17 '22 19:08 holzman

It's been another three weeks - checking in on this PR yet again.

holzman avatar Sep 07 '22 17:09 holzman

Bump - it's been radio silence since mid-August. Please let me know what I can do to help move this forward.

holzman avatar Sep 30 '22 17:09 holzman

@gjoseph92 or @crusaderky -- Can one of you pick this up?

hayesgb avatar Sep 30 '22 19:09 hayesgb