dask-kubernetes icon indicating copy to clipboard operation
dask-kubernetes copied to clipboard

Reset Dask worker to use TCP even if it was configured to use TLS in yaml file

Open weiwang217 opened this issue 2 years ago • 8 comments

Describe the issue: The DASK operator reset to use TCP even if it was configured to use TLS

- name: DASK_SCHEDULER_ADDRESS
value: tls://scheduler.join.svc.cluster.local:8786
- name: DASK_TEMPORARY_DIRECTORY
value: /tmp
- name: DASK_WORKER_NAME
value: default-worker-6a9c9e4f94
- name: DASK_SCHEDULER_ADDRESS
value: tcp://scheduler.join.svc.cluster.local:8786

The code to append the config is here: https://github.com/dask/dask-kubernetes/blob/fa7255b95e686025fd818f718b83635ff9424769/dask_kubernetes/operator/controller/controller.py#L156

Minimal Complete Verifiable Example:

# Put your MCVE code here

Anything else we need to know?:

Environment:

  • Dask version:
  • Python version:
  • Operating System:
  • Install method (conda, pip, source):

weiwang217 avatar Oct 18 '23 21:10 weiwang217

Thanks for raising this @weiwang217. I've opened #837 to resolve this. Would you mind testing that PR out and letting me know if it solves your problem?

jacobtomlinson avatar Oct 19 '23 10:10 jacobtomlinson

Thanks! How can I build and install the dask operator?

Jacob Tomlinson @.***> 于2023年10月19日周四 03:48写道:

Thanks for raising this @weiwang217 https://github.com/weiwang217. I've opened #837 https://github.com/dask/dask-kubernetes/pull/837 to resolve this. Would you mind testing that PR out and letting me know if it solves your problem?

— Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/836#issuecomment-1770555439, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZF2RVJUPVM3WL7I4EDBN3YAEAPRAVCNFSM6AAAAAA6GFV5BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZQGU2TKNBTHE . You are receiving this because you were mentioned.Message ID: @.***>

-- Wang, Wei

MAIL: @.***

weiwang217 avatar Oct 19 '23 17:10 weiwang217

We have documentation on how to do this here https://kubernetes.dask.org/en/latest/testing.html#testing-operator-controller-prs

jacobtomlinson avatar Oct 20 '23 09:10 jacobtomlinson

It worked. When is the code going to be merged into the main? Thanks!

Thanks, Wei

Jacob Tomlinson @.***> 于2023年10月20日周五 02:39写道:

We have documentation on how to do this here https://kubernetes.dask.org/en/latest/testing.html#testing-operator-controller-prs

— Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/836#issuecomment-1772407788, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZF2RUJWBMGVBJMVI6T5BLYAJBFJAVCNFSM6AAAAAA6GFV5BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZSGQYDONZYHA . You are receiving this because you were mentioned.Message ID: @.***>

-- Wang, Wei

MAIL: @.***

weiwang217 avatar Oct 20 '23 23:10 weiwang217

Hi Jacob,

I have a suspicion that change may have caused a regression when working with replicas > 1. When I start a new DaskJob, all but one replica fails to connect to the scheduler because of duplicate names. Indeed when I run kubectl describe pod <worker_pod>

I see: Worker 1:

    Environment:
      DASK_WORKER_NAME:                                  simple-job-default-worker-a10a25ac26
      DASK_SCHEDULER_ADDRESS:                      tcp://simple-job-scheduler.join.svc.cluster.local:8786
      ...

Worker 2:

    Environment:
      DASK_WORKER_NAME:                                  simple-job-default-worker-00add84cde
      DASK_SCHEDULER_ADDRESS:                      tcp://simple-job-scheduler.join.svc.cluster.local:8786
      DASK_WORKER_NAME:                                   simple-job-default-worker-a10a25ac26
      DASK_SCHEDULER_ADDRESS:                      tcp://simple-job-scheduler.join.svc.cluster.local:8786

Because the last defined environment variable is the first replica, all replicas share the same name.

Do you mind taking a look?

(Context: I'm on the same team as weiwang217 and we just noticed this change recently)

kjleftin avatar Oct 26 '23 23:10 kjleftin

Thanks for reporting this @kjleftin. Why are you setting the DASK_WORKER_NAME in your config?

jacobtomlinson avatar Oct 31 '23 16:10 jacobtomlinson

Hi Jacob,

I'm following the example code in https://kubernetes.dask.org/en/latest/operator_resources.html#daskjob

Specifically, passing the DASK_WORKER_NAME env. variable to the dask worker CLI:

            - name: worker
              image: "ghcr.io/dask/dask:latest"
              imagePullPolicy: "IfNotPresent"
              args:
                - dask-worker
                - --name
                - $(DASK_WORKER_NAME)
                - --dashboard
                - --dashboard-address
                - "8788"

Note that I'm not setting DASK_WORKER_NAME explicitly. That is handled by the Dask Operator. (Before this change, each worker would have a different value for DASK_WORKER_NAME, but after this change, each worker has the same value).

kjleftin avatar Oct 31 '23 22:10 kjleftin

@kjleftin ok thanks for the clarification. I expect we may need to use copy to avoid this. I'll take a look at the PR and update it.

jacobtomlinson avatar Nov 01 '23 17:11 jacobtomlinson