adaptdl icon indicating copy to clipboard operation
adaptdl copied to clipboard

Problem when installing adaptdl scheduler

Open gudiandian opened this issue 2 years ago • 8 comments

Hi I am trying to install the scheduler with helm

sudo helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --set docker-registry.enabled=true

However, the contents in the templates seem not to be installed. I tried to run ps aux | grep python, but there is no "adaptdl_sched.allocator" process or "adaptdl_sched.supervisor", "adaptdl_sched.validator". The schedulers' services seems to be okay:

$ kubectl describe services
Name:              adaptdl-adaptdl-sched
Namespace:         default
Labels:            app=adaptdl-sched
                   app.kubernetes.io/managed-by=Helm
                   release=adaptdl
Annotations:       meta.helm.sh/release-name: adaptdl
                   meta.helm.sh/release-namespace: default
Selector:          app=adaptdl-sched,release=adaptdl
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.104.151.75
IPs:               10.104.151.75
Port:              http  9091/TCP
TargetPort:        9091/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>


Name:                     adaptdl-registry
Namespace:                default
Labels:                   app=docker-registry
                          app.kubernetes.io/managed-by=Helm
                          chart=docker-registry-1.9.4
                          heritage=Helm
                          release=adaptdl
Annotations:              meta.helm.sh/release-name: adaptdl
                          meta.helm.sh/release-namespace: default
Selector:                 app=docker-registry,release=adaptdl
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.103.68.10
IPs:                      10.103.68.10
Port:                     registry  5000/TCP
TargetPort:               5000/TCP
NodePort:                 registry  32000/TCP
Endpoints:                <none>
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>


Name:              adaptdl-supervisor
Namespace:         default
Labels:            app=adaptdl-sched
                   app.kubernetes.io/managed-by=Helm
                   release=adaptdl
Annotations:       meta.helm.sh/release-name: adaptdl
                   meta.helm.sh/release-namespace: default
Selector:          app=adaptdl-sched,release=adaptdl
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.111.1.168
IPs:               10.111.1.168
Port:              http  8080/TCP
TargetPort:        8080/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>


Name:              adaptdl-validator
Namespace:         default
Labels:            app=adaptdl-validator
                   app.kubernetes.io/managed-by=Helm
                   release=adaptdl
Annotations:       meta.helm.sh/release-name: adaptdl
                   meta.helm.sh/release-namespace: default
Selector:          app=adaptdl-validator,release=adaptdl
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.102.34.180
IPs:               10.102.34.180
Port:              https  443/TCP
TargetPort:        https/TCP
Endpoints:         <none>
Session Affinity:  None
Events:            <none>


Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.96.0.1
IPs:               10.96.0.1
Port:              https  443/TCP
TargetPort:        6443/TCP
Endpoints:         10.5.0.4:6443
Session Affinity:  None
Events:            <none>

When I try to run an adaptdl job, this error always exists:

  File "run_workload.py", line 136, in <module>
    objs_api.create_namespaced_custom_object(*obj_args, job)
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object
    return self.create_namespaced_custom_object_with_http_info(group, version, namespace, plural, body, **kwargs)  # noqa: E501
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 344, in create_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 391, in request
    return self.rest_client.POST(url,
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/rest.py", line 274, in POST
    return self.request("POST", url,
  File "/home/ubuntu/software/miniconda3/envs/pollux/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '01c68c3b-393c-419a-9b81-d3393b80d47f', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'abb23a1c-5ca0-4b97-a67f-58f65e44bf9d', 'Date': 'Wed, 15 Jun 2022 14:49:52 GMT', 'Content-Length': '521'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"adaptdl-validator.default.svc.cluster.local\": Post \"https://adaptdl-validator.default.svc:443/validate?timeout=10s\": context deadline exceeded","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"adaptdl-validator.default.svc.cluster.local\": Post \"https://adaptdl-validator.default.svc:443/validate?timeout=10s\": context deadline exceeded"}]},"code":500}

I followed these commands to setup the environments:

CNI_VERSION="v0.8.2"
ARCH="amd64"
sudo mkdir -p /opt/cni/bin
curl -L "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-${ARCH}-${CNI_VERSION}.tgz" | sudo tar -C /opt/cni/bin -xz
DOWNLOAD_DIR=/usr/local/bin
sudo mkdir -p $DOWNLOAD_DIR
CRICTL_VERSION="v1.22.0"

ARCH="amd64"
curl -L "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-${ARCH}.tar.gz" | sudo tar -C $DOWNLOAD_DIR -xz
RELEASE="v1.21.0"
ARCH="amd64"
cd $DOWNLOAD_DIR
sudo curl -L --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/${ARCH}/{kubeadm,kubelet,kubectl}
sudo chmod +x {kubeadm,kubelet,kubectl}
RELEASE_VERSION="v0.4.0"
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service
sudo mkdir -p /etc/systemd/system/kubelet.service.d
curl -sSL "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | sudo tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
systemctl enable --now kubelet

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh -b -p ${HOME}/software/miniconda3
echo "export PATH=${HOME}/software/miniconda3/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc


sudo apt install conntrack

sudo snap install yq --channel=v3/stable
sudo kubeadm init --pod-network-cidr=192.168.0.0/16
mkdir -p ~/.kube
sudo cp /etc/kubernetes/admin.conf ~/.kube/config
sudo chown -f -R $USER ~/.kube
kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta5/nvidia-device-plugin.yml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/common.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/operator.yaml
curl -s https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/cluster.yaml | /snap/bin/yq w - spec.storage.deviceFilter nvme0n1p2 | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/filesystem.yaml
kubectl apply -f https://raw.githubusercontent.com/rook/rook/v1.3.1/cluster/examples/kubernetes/ceph/csi/cephfs/storageclass.yaml
docker login -u ${var.docker_username} -p '${var.docker_password}'
kubectl create secret generic regcred --from-file=.dockerconfigjson=/home/ubuntu/.docker/config.json --type=kubernetes.io/dockerconfigjson
helm repo add stable https://charts.helm.sh/stable --force-update
conda env update -f ~/adaptdl/benchmark/environment.yaml # path

#install helm (https://helm.sh/docs/intro/install/)

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

# install scheduler
sudo helm install adaptdl adaptdl-sched --repo https://github.com/petuum/adaptdl/raw/helm-repo --namespace default --set docker-registry.enabled=true

What is really strange is that everything seems to be fine:

NAME                                        READY   STATUS    RESTARTS   AGE
pod/adaptdl-adaptdl-sched-cbc794b8f-8xq2f   3/3     Running   0          36m
pod/adaptdl-registry-76d9c8b759-tqrdv       1/1     Running   0          36m
pod/adaptdl-validator-d878bc9c9-ddglc       1/1     Running   0          36m
pod/images-6jsfj                            6/6     Running   0          103m
pod/images-gldv7                            6/6     Running   0          103m
pod/images-lprhh                            6/6     Running   0          103m
pod/images-qsglk                            6/6     Running   0          103m

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
service/adaptdl-adaptdl-sched   ClusterIP   10.102.52.25     <none>        9091/TCP         36m
service/adaptdl-registry        NodePort    10.100.70.211    <none>        5000:32000/TCP   36m
service/adaptdl-supervisor      ClusterIP   10.111.108.197   <none>        8080/TCP         36m
service/adaptdl-validator       ClusterIP   10.98.19.54      <none>        443/TCP          36m
service/kubernetes              ClusterIP   10.96.0.1        <none>        443/TCP          7h6m

NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/images   4         4         4       4            4           <none>          103m

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/adaptdl-adaptdl-sched   1/1     1            1           36m
deployment.apps/adaptdl-registry        1/1     1            1           36m
deployment.apps/adaptdl-validator       1/1     1            1           36m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/adaptdl-adaptdl-sched-cbc794b8f   1         1         1       36m
replicaset.apps/adaptdl-registry-76d9c8b759       1         1         1       36m
replicaset.apps/adaptdl-validator-d878bc9c9       1         1         1       36m

Do you know why would this happen? Thank you!

gudiandian avatar Jun 15 '22 15:06 gudiandian

Could you provide the output of helm list and kubectl get all?

aurickq avatar Jun 15 '22 15:06 aurickq

Sure

 $ helm list
NAME   	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART               	APP VERSION
adaptdl	default  	1       	2022-06-15 14:43:38.613906983 +0000 UTC	deployed	adaptdl-sched-0.2.11	0.2.11
$ kubectl get all
NAME                                        READY   STATUS    RESTARTS   AGE
pod/adaptdl-adaptdl-sched-cbc794b8f-8xq2f   3/3     Running   0          45m
pod/adaptdl-registry-76d9c8b759-tqrdv       1/1     Running   0          45m
pod/adaptdl-validator-d878bc9c9-ddglc       1/1     Running   0          45m
pod/images-6jsfj                            6/6     Running   0          112m
pod/images-gldv7                            6/6     Running   0          112m
pod/images-lprhh                            6/6     Running   0          112m
pod/images-qsglk                            6/6     Running   0          112m

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
service/adaptdl-adaptdl-sched   ClusterIP   10.102.52.25     <none>        9091/TCP         45m
service/adaptdl-registry        NodePort    10.100.70.211    <none>        5000:32000/TCP   45m
service/adaptdl-supervisor      ClusterIP   10.111.108.197   <none>        8080/TCP         45m
service/adaptdl-validator       ClusterIP   10.98.19.54      <none>        443/TCP          45m
service/kubernetes              ClusterIP   10.96.0.1        <none>        443/TCP          7h15m

NAME                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/images   4         4         4       4            4           <none>          112m

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/adaptdl-adaptdl-sched   1/1     1            1           45m
deployment.apps/adaptdl-registry        1/1     1            1           45m
deployment.apps/adaptdl-validator       1/1     1            1           45m

NAME                                              DESIRED   CURRENT   READY   AGE
replicaset.apps/adaptdl-adaptdl-sched-cbc794b8f   1         1         1       45m
replicaset.apps/adaptdl-registry-76d9c8b759       1         1         1       45m
replicaset.apps/adaptdl-validator-d878bc9c9       1         1         1       45m

Thank you for your quick reply!

gudiandian avatar Jun 15 '22 15:06 gudiandian

Besides, I fount that if I set the repository to be the default localhost:32000/pollux, there will be this error:

tGet "http://localhost:32000/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Currently I am using docker hub as a workaround. I don't know if this error is related to the above one, because they all seem to be related to the installation of scheduler. I would really appreciate any help. Thanks!

gudiandian avatar Jun 15 '22 16:06 gudiandian

It seems a bit like certain pairs (or all pairs?) of Kubernetes pods are not able to reach each other. Could you try:

  • kubectl exec into a pod and try to ping a different pod.
  • Check the health of the calico plugin (in the kube-system namespace).

aurickq avatar Jun 15 '22 16:06 aurickq

Thank you for your suggestions❤️ As for the first advice, unfortunately, I found that the only pods are the scheduler pods or image pods, which both cannot run ping or install the ping command. I am sure that the servers can reach each other. Do you have any other methods to test if the pods can reach each other?

For the second advice, I checked the pod, deployment, and daemonsets in kube-system namespace and I see no error.

$ kubectl get all -n kube-system
NAME                                                READY   STATUS    RESTARTS   AGE
pod/calico-kube-controllers-5bcd7db644-6z8f6        1/1     Running   0          11m
pod/calico-node-hp4p6                               1/1     Running   0          6m7s
pod/calico-node-j5r62                               1/1     Running   0          6m1s
pod/calico-node-tqx9p                               1/1     Running   0          6m3s
pod/calico-node-vp84l                               1/1     Running   0          11m
pod/coredns-558bd4d5db-f9r56                        1/1     Running   0          12m
pod/coredns-558bd4d5db-wr5wm                        1/1     Running   0          12m
pod/etcd-superbench-dev-00000g                      1/1     Running   0          12m
pod/kube-apiserver-superbench-dev-00000g            1/1     Running   0          12m
pod/kube-controller-manager-superbench-dev-00000g   1/1     Running   0          12m
pod/kube-proxy-cj9gh                                1/1     Running   0          12m
pod/kube-proxy-l4dxb                                1/1     Running   0          6m1s
pod/kube-proxy-pmg2j                                1/1     Running   0          6m3s
pod/kube-proxy-sc7cl                                1/1     Running   0          6m7s
pod/kube-scheduler-superbench-dev-00000g            1/1     Running   0          12m
pod/nvidia-device-plugin-daemonset-4fhbv            1/1     Running   0          5m53s
pod/nvidia-device-plugin-daemonset-s2vgb            1/1     Running   0          5m51s
pod/nvidia-device-plugin-daemonset-shq25            1/1     Running   0          5m57s

NAME               TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
service/kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   12m

NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                 AGE
daemonset.apps/calico-node                      4         4         4       4            4           beta.kubernetes.io/os=linux   11m
daemonset.apps/kube-proxy                       4         4         4       4            4           kubernetes.io/os=linux        12m
daemonset.apps/nvidia-device-plugin-daemonset   3         3         3       3            3           <none>                        11m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/calico-kube-controllers   1/1     1            1           11m
deployment.apps/coredns                   2/2     2            2           12m

NAME                                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/calico-kube-controllers-5bcd7db644   1         1         1       11m
replicaset.apps/coredns-558bd4d5db                   2         2         2       12m

Thanks again for your help!

gudiandian avatar Jun 15 '22 17:06 gudiandian

You'll want to create a pod running a container that can run ping.

aurickq avatar Jun 15 '22 23:06 aurickq

You'll want to create a pod running a container that can run ping.

Thank you for your suggestion! It does seem that there is some networking problem. I found the reason why I can not run ping is that I can never run apt-get to install the ping command in any pod:

root@hello-world:~/# apt-get install inetutils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package inetutils-ping
root@hello-world:~/# ifconfig
bash: ifconfig: command not found
root@hello-world:~/# apt-get install net-tools
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package net-tools
root@hello-world:~/# apt-get update
Err:1 http://archive.ubuntu.com/ubuntu focal InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu focal-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu focal-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu focal-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-backports/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/focal-security/InRelease  Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.

I tried to build a docker container with docker, and I can successfully run apt-get in the containers. So there must be something wrong with the networking of Kubernetes. However, I found no error in the coredns or calico-related pods. I am quite sure that I have run sudo kubeadm init --pod-network-cidr=192.168.0.0/16 and kubectl apply -f https://docs.projectcalico.org/v3.11/manifests/calico.yaml.

Then I ran two pods on the same nodeswith net-tools. They can reach each other. But, it seems that the pods on different nodes cannot reach each other. On pod 1: image On pod 2: image

Besides, I found that because my master node has a node-role.kubernetes.io/master:NoSchedule taint, the adaptdl scheduler seems to be running on a worker node. Because when I tried ps aux | grep python on a worker node, I found that the adaptdl_sched.validator, allocator, and supervisor are on the worker node:

root     25238  0.0  0.0  60192 52324 ?        Ss   Jun15   0:12 python -m adaptdl_sched.validator --host=0.0.0.0 --port=8443 --tls-crt=/mnt/tls.crt --tls-key=/mnt/tls.key
root     25437  0.0  0.0 133308 51832 ?        Ssl  Jun15   0:02 python -m adaptdl_sched
root     25622  0.2  0.0 5369344 125136 ?      Ssl  Jun15   1:07 python -m adaptdl_sched.allocator
root     25880  0.0  0.0 5360948 116412 ?      Ssl  Jun15   0:10 python -m adaptdl_sched.supervisor

I know that this is normal because the pods do not run on the master node. But is this probably the reason why I can not access the validator?

When I try to run an adaptdl job, this error always exists:

Thank you!

gudiandian avatar Jun 16 '22 02:06 gudiandian

The scheduler pods running on worker nodes is expected behavior.

I suggest making sure that inter-pod networking in your Kubernetes cluster is correctly working first, e.g. by following https://projectcalico.docs.tigera.io/getting-started/kubernetes/hardway/test-networking. AdaptDL depends on functional networking between different pods, and it can cause some of the failures you're experiencing.

aurickq avatar Jun 16 '22 03:06 aurickq

This problem was solved by uninstalling calico, deleting all calico-related files manually, and reinstall.

gudiandian avatar Nov 27 '22 06:11 gudiandian