kubernetes.core icon indicating copy to clipboard operation
kubernetes.core copied to clipboard

k8s_drain "Failed to delete pod" "Too many requests"

Open impsik opened this issue 2 years ago • 2 comments

SUMMARY

I try to drain kubernetes nodes, do some patching and uncordon those nodes, but sometimes it fails with "msg": "Failed to delete pod POD NAME HERE due to: Too Many Requests" It's usually Longhorn POD. And it's empty cluster (1etcd/CP, 3 worker nodes), for testing only. When i drain manually it will take up to 2min to drain that node.

node/192.168.122.11 evicted

real	1m50.449s
user	0m1.128s
sys	0m0.645s
ISSUE TYPE
  • Bug Report
COMPONENT NAME

k8s_drain

ANSIBLE VERSION
ansible [core 2.12.6]

COLLECTION VERSION

# /usr/lib/python3/dist-packages/ansible_collections
Collection      Version
--------------- -------
kubernetes.core 2.3.1

CONFIGURATION
DEFAULT_HOST_LIST(/etc/ansible/ansible.cfg) = ['/home/username/hosts']
DEPRECATION_WARNINGS(/etc/ansible/ansible.cfg) = False

OS / ENVIRONMENT

NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)"

STEPS TO REPRODUCE
- name: "Drain node {{ inventory_hostname|lower }}, even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it."
      kubernetes.core.k8s_drain:
        state: drain
        name: "{{ inventory_hostname|lower }}"
        kubeconfig: ~/.kube/config
        delete_options:
          ignore_daemonsets: yes
          delete_emptydir_data: yes
          force: yes
          terminate_grace_period: 5
          wait_sleep: 20
      delegate_to: localhost

EXPECTED RESULTS

TASK [Drain node 192.168.122.11, even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it.] ***************************************************************************************** changed: [192.168.122.11 -> localhost]

ACTUAL RESULTS

The full traceback is: File "/tmp/ansible_kubernetes.core.k8s_drain_payload_favq191w/ansible_kubernetes.core.k8s_drain_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_drain.py", line 324, in evict_pods File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 7652, in create_namespaced_pod_eviction return self.create_namespaced_pod_eviction_with_http_info(name, namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 7759, in create_namespaced_pod_eviction_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 275, in POST return self.request("POST", url, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 234, in request raise ApiException(http_resp=r) fatal: [192.168.122.11 -> localhost]: FAILED! => { "changed": false, "invocation": { "module_args": { "api_key": null, "ca_cert": null, "client_cert": null, "client_key": null, "context": null, "delete_options": { "delete_emptydir_data": true, "disable_eviction": false, "force": true, "ignore_daemonsets": true, "terminate_grace_period": 5, "wait_sleep": 20, "wait_timeout": null }, "host": null, "impersonate_groups": null, "impersonate_user": null, "kubeconfig": "/home/imre/.kube/config", "name": "192.168.122.11", "no_proxy": null, "password": null, "persist_config": null, "proxy": null, "proxy_headers": null, "state": "drain", "username": null, "validate_certs": null } }, "msg": "Failed to delete pod longhorn-system/instance-manager-e-f5feaabb due to: Too Many Requests" }


impsik avatar Jun 08 '22 17:06 impsik

@impsik Thanks for filing the issue. The eviction API can sometimes return a 429 Too Many Requests status, especially if the eviction would violate the pod disruption budget. It's unclear if that's what is going on in your case or not. We should be retrying 429 responses here, and we aren't. As a workaround until this is fixed, you could try using Ansible's built-in retry logic.

gravesm avatar Jun 27 '22 15:06 gravesm

@gravesm My setup was (i blew it up): 3 etcd/cp nodes, 3 worker nodes. SSD disks.

Node config:
OS type and version: Ubuntu 20.04.4 LTS
CPU per node: 4
Memory per node: 8GB

I installed Longhorn, 3 replicas, trough Rancher apps. I installed Wordpress for testing LINK and this created 2 PVC, one for SQL, one for wordpress. From the Longhorn docs i found that i also need to use --pod-selector='app!=csi-attacher,app!=csi-provisioner' However pod-selector option is not supported by kubernetes.core.k8s_drain

$ kubectl get poddisruptionbudgets -n longhorn-system
NAME                          MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
instance-manager-e-5369c190   1               N/A               0                     19m
instance-manager-e-662fd89d   1               N/A               0                     13m
instance-manager-e-b944b669   1               N/A               0                     16m
instance-manager-r-26f65c39   1               N/A               0                     16m
instance-manager-r-6030e0e8   1               N/A               0                     13m
instance-manager-r-6314b679   1               N/A               0                     19m

Another workaround for me was to use shell module, works fine: shell: kubectl drain {{ inventory_hostname|lower }} --ignore-daemonsets --delete-emptydir-data --force --pod-selector='app!=csi-attacher,app!=csi-provisioner' --kubeconfig ~/.kube/config

impsik avatar Jun 28 '22 08:06 impsik

I also can confirm this behavior. It would be nice if the kubernetes.core module can support the pod-selector configuration. The usage of the shell module is just a "dirty workaround".

0Styless avatar Nov 18 '22 07:11 0Styless

Same here, few Pods on a Kubernetes control-plane node, no PDB on the Pods. Sometimes it does not evict any Pods.

The only reliable workaround for the moment is to revert to kubectl.

stephan2012 avatar Feb 03 '23 18:02 stephan2012