adaptdl
adaptdl copied to clipboard
Problem when installing adaptdl scheduler
Hi I am trying to install the scheduler with helm
sudo helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --set docker-registry.enabled=true
However, the contents in the templates seem not to be installed. I tried to run ps aux | grep python
, but there is no "adaptdl_sched.allocator" process or "adaptdl_sched.supervisor", "adaptdl_sched.validator". The schedulers' services seems to be okay:
$ kubectl describe services
Name: adaptdl-adaptdl-sched
Namespace: default
Labels: app=adaptdl-sched
app.kubernetes.io/managed-by=Helm
release=adaptdl
Annotations: meta.helm.sh/release-name: adaptdl
meta.helm.sh/release-namespace: default
Selector: app=adaptdl-sched,release=adaptdl
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.104.151.75
IPs: 10.104.151.75
Port: http 9091/TCP
TargetPort: 9091/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
Name: adaptdl-registry
Namespace: default
Labels: app=docker-registry
app.kubernetes.io/managed-by=Helm
chart=docker-registry-1.9.4
heritage=Helm
release=adaptdl
Annotations: meta.helm.sh/release-name: adaptdl
meta.helm.sh/release-namespace: default
Selector: app=docker-registry,release=adaptdl
Type: NodePort
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.103.68.10
IPs: 10.103.68.10
Port: registry 5000/TCP
TargetPort: 5000/TCP
NodePort: registry 32000/TCP
Endpoints: <none>
Session Affinity: None
External Traffic Policy: Cluster
Events: <none>
Name: adaptdl-supervisor
Namespace: default
Labels: app=adaptdl-sched
app.kubernetes.io/managed-by=Helm
release=adaptdl
Annotations: meta.helm.sh/release-name: adaptdl
meta.helm.sh/release-namespace: default
Selector: app=adaptdl-sched,release=adaptdl
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.111.1.168
IPs: 10.111.1.168
Port: http 8080/TCP
TargetPort: 8080/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
Name: adaptdl-validator
Namespace: default
Labels: app=adaptdl-validator
app.kubernetes.io/managed-by=Helm
release=adaptdl
Annotations: meta.helm.sh/release-name: adaptdl
meta.helm.sh/release-namespace: default
Selector: app=adaptdl-validator,release=adaptdl
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.102.34.180
IPs: 10.102.34.180
Port: https 443/TCP
TargetPort: https/TCP
Endpoints: <none>
Session Affinity: None
Events: <none>
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.96.0.1
IPs: 10.96.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 10.5.0.4:6443
Session Affinity: None
Events: <none>
When I try to run an adaptdl job, this error always exists:
File "run_workload.py", line 136, in <module>
objs_api.create_namespaced_custom_object(*obj_args, job)
File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object
return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs) # noqa: E501
File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 344, in create_namespaced_custom_object_with_http_info
return self.api_client.call_api(
File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request
return self.rest_client.POST(url,
File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/rest.py", line 274, in POST
return self.request("POST", url,
File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '01c68c3b-393c-419a-9b81-d3393b80d47f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'abb23a1c-5ca0-4b97-a67f-58f65e44bf9d', 'Date': 'Wed, 15 Jun 2022 14:49:52 GMT', 'Content-Length': '521'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"adaptdl-validator.default.svc.cluster.local\": Post \"https://adaptdl-validator.default.svc:443/validate?timeout=10s\": context deadline exceeded","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"adaptdl-validator.default.svc.cluster.local\": Post \"https://adaptdl-validator.default.svc:443/validate?timeout=10s\": context deadline exceeded"}]},"code":500}
I followed these commands to setup the environments:
CNI_VERSION="v0.8.2"
ARCH="amd64"
sudo mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-${ARCH}-${CNI_VERSION}.tgz" | sudo tar -C /opt/cni/bin -xz
DOWNLOAD_DIR=/usr/local/bin
sudo mkdir -p $DOWNLOAD_DIR
CRICTL_VERSION="v1.22.0"
ARCH="amd64"
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz
RELEASE="v1.21.0"
ARCH="amd64"
cd $DOWNLOAD_DIR
sudo curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet,kubectl}
sudo chmod +x {kubeadm,kubelet,kubectl}
RELEASE_VERSION="v0.4.0"
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service
sudo mkdir -p /etc/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
systemctl enable --now kubelet
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -p ${HOME}/software/miniconda3
echo "export PATH=${HOME}/software/miniconda3/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
sudo apt install conntrack
sudo snap install yq --channel=v3/stable
sudo kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p ~/.kube
sudo cp /etc/kubernetes/admin.conf ~/.kube/config
sudo chown -f -R $USER ~/.kube
kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta5/nvidia-device-plugin.yml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/common.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/operator.yaml
curl -s https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/cluster.yaml | /snap/bin/yq w - spec.storage.deviceFilter nvme0n1p2 | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/filesystem.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/csi/cephfs/storageclass.yaml
docker login -u ${var.docker_username} -p '${var.docker_password}'
kubectl create secret generic regcred --from-file=.dockerconfigjson=/home/ubuntu/.docker/config.json --type=kubernetes.io/dockerconfigjson
helm repo add stable https://charts.helm.sh/stable --force-update
conda env update -f ~/adaptdl/benchmark/environment.yaml # path
#install helm (https://helm.sh/docs/intro/install/)
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
# install scheduler
sudo helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --set docker-registry.enabled=true
What is really strange is that everything seems to be fine:
NAME READY STATUS RESTARTS AGE
pod/adaptdl-adaptdl-sched-cbc794b8f-8xq2f 3/3 Running 0 36m
pod/adaptdl-registry-76d9c8b759-tqrdv 1/1 Running 0 36m
pod/adaptdl-validator-d878bc9c9-ddglc 1/1 Running 0 36m
pod/images-6jsfj 6/6 Running 0 103m
pod/images-gldv7 6/6 Running 0 103m
pod/images-lprhh 6/6 Running 0 103m
pod/images-qsglk 6/6 Running 0 103m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/adaptdl-adaptdl-sched ClusterIP 10.102.52.25 <none> 9091/TCP 36m
service/adaptdl-registry NodePort 10.100.70.211 <none> 5000:32000/TCP 36m
service/adaptdl-supervisor ClusterIP 10.111.108.197 <none> 8080/TCP 36m
service/adaptdl-validator ClusterIP 10.98.19.54 <none> 443/TCP 36m
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7h6m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/images 4 4 4 4 4 <none> 103m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/adaptdl-adaptdl-sched 1/1 1 1 36m
deployment.apps/adaptdl-registry 1/1 1 1 36m
deployment.apps/adaptdl-validator 1/1 1 1 36m
NAME DESIRED CURRENT READY AGE
replicaset.apps/adaptdl-adaptdl-sched-cbc794b8f 1 1 1 36m
replicaset.apps/adaptdl-registry-76d9c8b759 1 1 1 36m
replicaset.apps/adaptdl-validator-d878bc9c9 1 1 1 36m
Do you know why would this happen? Thank you!
Could you provide the output of helm list
and kubectl get all
?
Sure
$ helm list
NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION
adaptdl default 1 2022-06-15 14:43:38.613906983 +0000 UTC deployed adaptdl-sched-0.2.11 0.2.11
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/adaptdl-adaptdl-sched-cbc794b8f-8xq2f 3/3 Running 0 45m
pod/adaptdl-registry-76d9c8b759-tqrdv 1/1 Running 0 45m
pod/adaptdl-validator-d878bc9c9-ddglc 1/1 Running 0 45m
pod/images-6jsfj 6/6 Running 0 112m
pod/images-gldv7 6/6 Running 0 112m
pod/images-lprhh 6/6 Running 0 112m
pod/images-qsglk 6/6 Running 0 112m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/adaptdl-adaptdl-sched ClusterIP 10.102.52.25 <none> 9091/TCP 45m
service/adaptdl-registry NodePort 10.100.70.211 <none> 5000:32000/TCP 45m
service/adaptdl-supervisor ClusterIP 10.111.108.197 <none> 8080/TCP 45m
service/adaptdl-validator ClusterIP 10.98.19.54 <none> 443/TCP 45m
service/kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 7h15m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/images 4 4 4 4 4 <none> 112m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/adaptdl-adaptdl-sched 1/1 1 1 45m
deployment.apps/adaptdl-registry 1/1 1 1 45m
deployment.apps/adaptdl-validator 1/1 1 1 45m
NAME DESIRED CURRENT READY AGE
replicaset.apps/adaptdl-adaptdl-sched-cbc794b8f 1 1 1 45m
replicaset.apps/adaptdl-registry-76d9c8b759 1 1 1 45m
replicaset.apps/adaptdl-validator-d878bc9c9 1 1 1 45m
Thank you for your quick reply!
Besides, I fount that if I set the repository to be the default localhost:32000/pollux
, there will be this error:
tGet "http://localhost:32000/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Currently I am using docker hub as a workaround. I don't know if this error is related to the above one, because they all seem to be related to the installation of scheduler. I would really appreciate any help. Thanks!
It seems a bit like certain pairs (or all pairs?) of Kubernetes pods are not able to reach each other. Could you try:
-
kubectl exec
into a pod and try toping
a different pod. - Check the health of the calico plugin (in the kube-system namespace).
Thank you for your suggestions❤️ As for the first advice, unfortunately, I found that the only pods are the scheduler pods or image pods, which both cannot run ping or install the ping command. I am sure that the servers can reach each other. Do you have any other methods to test if the pods can reach each other?
For the second advice, I checked the pod, deployment, and daemonsets in kube-system namespace and I see no error.
$ kubectl get all -n kube-system
NAME READY STATUS RESTARTS AGE
pod/calico-kube-controllers-5bcd7db644-6z8f6 1/1 Running 0 11m
pod/calico-node-hp4p6 1/1 Running 0 6m7s
pod/calico-node-j5r62 1/1 Running 0 6m1s
pod/calico-node-tqx9p 1/1 Running 0 6m3s
pod/calico-node-vp84l 1/1 Running 0 11m
pod/coredns-558bd4d5db-f9r56 1/1 Running 0 12m
pod/coredns-558bd4d5db-wr5wm 1/1 Running 0 12m
pod/etcd-superbench-dev-00000g 1/1 Running 0 12m
pod/kube-apiserver-superbench-dev-00000g 1/1 Running 0 12m
pod/kube-controller-manager-superbench-dev-00000g 1/1 Running 0 12m
pod/kube-proxy-cj9gh 1/1 Running 0 12m
pod/kube-proxy-l4dxb 1/1 Running 0 6m1s
pod/kube-proxy-pmg2j 1/1 Running 0 6m3s
pod/kube-proxy-sc7cl 1/1 Running 0 6m7s
pod/kube-scheduler-superbench-dev-00000g 1/1 Running 0 12m
pod/nvidia-device-plugin-daemonset-4fhbv 1/1 Running 0 5m53s
pod/nvidia-device-plugin-daemonset-s2vgb 1/1 Running 0 5m51s
pod/nvidia-device-plugin-daemonset-shq25 1/1 Running 0 5m57s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 12m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/calico-node 4 4 4 4 4 beta.kubernetes.io/os=linux 11m
daemonset.apps/kube-proxy 4 4 4 4 4 kubernetes.io/os=linux 12m
daemonset.apps/nvidia-device-plugin-daemonset 3 3 3 3 3 <none> 11m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/calico-kube-controllers 1/1 1 1 11m
deployment.apps/coredns 2/2 2 2 12m
NAME DESIRED CURRENT READY AGE
replicaset.apps/calico-kube-controllers-5bcd7db644 1 1 1 11m
replicaset.apps/coredns-558bd4d5db 2 2 2 12m
Thanks again for your help!
You'll want to create a pod running a container that can run ping
.
You'll want to create a pod running a container that can run
ping
.
Thank you for your suggestion! It does seem that there is some networking problem. I found the reason why I can not run ping is that I can never run apt-get
to install the ping command in any pod:
root@hello-world:~/# apt-get install inetutils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package inetutils-ping
root@hello-world:~/# ifconfig
bash: ifconfig: command not found
root@hello-world:~/# apt-get install net-tools
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package net-tools
root@hello-world:~/# apt-get update
Err:1 http://archive.ubuntu.com/ubuntu focal InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu focal-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-backports/InRelease Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/focal-security/InRelease Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
I tried to build a docker container with docker
, and I can successfully run apt-get in the containers. So there must be something wrong with the networking of Kubernetes. However, I found no error in the coredns or calico-related pods. I am quite sure that I have run sudo kubeadm init --pod-network-cidr=192.168.0.0/16
and kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
.
Then I ran two pods on the same nodeswith net-tools. They can reach each other. But, it seems that the pods on different nodes cannot reach each other.
On pod 1:
On pod 2:
Besides, I found that because my master node has a node-role.kubernetes.io/master:NoSchedule
taint, the adaptdl scheduler seems to be running on a worker node. Because when I tried ps aux | grep python
on a worker node, I found that the adaptdl_sched.validator, allocator, and supervisor are on the worker node:
root 25238 0.0 0.0 60192 52324 ? Ss Jun15 0:12 python -m adaptdl_sched.validator --host=0.0.0.0 --port=8443 --tls-crt=/mnt/tls.crt --tls-key=/mnt/tls.key
root 25437 0.0 0.0 133308 51832 ? Ssl Jun15 0:02 python -m adaptdl_sched
root 25622 0.2 0.0 5369344 125136 ? Ssl Jun15 1:07 python -m adaptdl_sched.allocator
root 25880 0.0 0.0 5360948 116412 ? Ssl Jun15 0:10 python -m adaptdl_sched.supervisor
I know that this is normal because the pods do not run on the master node. But is this probably the reason why I can not access the validator?
When I try to run an adaptdl job, this error always exists:
Thank you!
The scheduler pods running on worker nodes is expected behavior.
I suggest making sure that inter-pod networking in your Kubernetes cluster is correctly working first, e.g. by following https://projectcalico.docs.tigera.io/getting-started/kubernetes/hardway/test-networking. AdaptDL depends on functional networking between different pods, and it can cause some of the failures you're experiencing.
This problem was solved by uninstalling calico, deleting all calico-related files manually, and reinstall.