Reset Dask worker to use TCP even if it was configured to use TLS in yaml file
Describe the issue: The DASK operator reset to use TCP even if it was configured to use TLS
- name: DASK_SCHEDULER_ADDRESS
value: tls://scheduler.join.svc.cluster.local:8786
- name: DASK_TEMPORARY_DIRECTORY
value: /tmp
- name: DASK_WORKER_NAME
value: default-worker-6a9c9e4f94
- name: DASK_SCHEDULER_ADDRESS
value: tcp://scheduler.join.svc.cluster.local:8786
The code to append the config is here: https://github.com/dask/dask-kubernetes/blob/fa7255b95e686025fd818f718b83635ff9424769/dask_kubernetes/operator/controller/controller.py#L156
Minimal Complete Verifiable Example:
# Put your MCVE code here
Anything else we need to know?:
Environment:
- Dask version:
- Python version:
- Operating System:
- Install method (conda, pip, source):
Thanks for raising this @weiwang217. I've opened #837 to resolve this. Would you mind testing that PR out and letting me know if it solves your problem?
Thanks! How can I build and install the dask operator?
Jacob Tomlinson @.***> 于2023年10月19日周四 03:48写道:
Thanks for raising this @weiwang217 https://github.com/weiwang217. I've opened #837 https://github.com/dask/dask-kubernetes/pull/837 to resolve this. Would you mind testing that PR out and letting me know if it solves your problem?
— Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/836#issuecomment-1770555439, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZF2RVJUPVM3WL7I4EDBN3YAEAPRAVCNFSM6AAAAAA6GFV5BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZQGU2TKNBTHE . You are receiving this because you were mentioned.Message ID: @.***>
-- Wang, Wei
MAIL: @.***
We have documentation on how to do this here https://kubernetes.dask.org/en/latest/testing.html#testing-operator-controller-prs
It worked. When is the code going to be merged into the main? Thanks!
Thanks, Wei
Jacob Tomlinson @.***> 于2023年10月20日周五 02:39写道:
We have documentation on how to do this here https://kubernetes.dask.org/en/latest/testing.html#testing-operator-controller-prs
— Reply to this email directly, view it on GitHub https://github.com/dask/dask-kubernetes/issues/836#issuecomment-1772407788, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZF2RUJWBMGVBJMVI6T5BLYAJBFJAVCNFSM6AAAAAA6GFV5BWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZSGQYDONZYHA . You are receiving this because you were mentioned.Message ID: @.***>
-- Wang, Wei
MAIL: @.***
Hi Jacob,
I have a suspicion that change may have caused a regression when working with replicas > 1. When I start a new DaskJob, all but one replica fails to connect to the scheduler because of duplicate names. Indeed when I run
kubectl describe pod <worker_pod>
I see: Worker 1:
Environment:
DASK_WORKER_NAME: simple-job-default-worker-a10a25ac26
DASK_SCHEDULER_ADDRESS: tcp://simple-job-scheduler.join.svc.cluster.local:8786
...
Worker 2:
Environment:
DASK_WORKER_NAME: simple-job-default-worker-00add84cde
DASK_SCHEDULER_ADDRESS: tcp://simple-job-scheduler.join.svc.cluster.local:8786
DASK_WORKER_NAME: simple-job-default-worker-a10a25ac26
DASK_SCHEDULER_ADDRESS: tcp://simple-job-scheduler.join.svc.cluster.local:8786
Because the last defined environment variable is the first replica, all replicas share the same name.
Do you mind taking a look?
(Context: I'm on the same team as weiwang217 and we just noticed this change recently)
Thanks for reporting this @kjleftin. Why are you setting the DASK_WORKER_NAME in your config?
Hi Jacob,
I'm following the example code in https://kubernetes.dask.org/en/latest/operator_resources.html#daskjob
Specifically, passing the DASK_WORKER_NAME env. variable to the dask worker CLI:
- name: worker
image: "ghcr.io/dask/dask:latest"
imagePullPolicy: "IfNotPresent"
args:
- dask-worker
- --name
- $(DASK_WORKER_NAME)
- --dashboard
- --dashboard-address
- "8788"
Note that I'm not setting DASK_WORKER_NAME explicitly. That is handled by the Dask Operator. (Before this change, each worker would have a different value for DASK_WORKER_NAME, but after this change, each worker has the same value).
@kjleftin ok thanks for the clarification. I expect we may need to use copy to avoid this. I'll take a look at the PR and update it.