kubernetes.core
kubernetes.core copied to clipboard
k8s_drain "Failed to delete pod" "Too many requests"
SUMMARY
I try to drain kubernetes nodes, do some patching and uncordon those nodes, but sometimes it fails with
"msg": "Failed to delete pod POD NAME HERE due to: Too Many Requests"
It's usually Longhorn POD.
And it's empty cluster (1etcd/CP, 3 worker nodes), for testing only.
When i drain manually it will take up to 2min to drain that node.
node/192.168.122.11 evicted
real 1m50.449s
user 0m1.128s
sys 0m0.645s
ISSUE TYPE
- Bug Report
COMPONENT NAME
k8s_drain
ANSIBLE VERSION
ansible [core 2.12.6]
COLLECTION VERSION
# /usr/lib/python3/dist-packages/ansible_collections
Collection Version
--------------- -------
kubernetes.core 2.3.1
CONFIGURATION
DEFAULT_HOST_LIST(/etc/ansible/ansible.cfg) = ['/home/username/hosts']
DEPRECATION_WARNINGS(/etc/ansible/ansible.cfg) = False
OS / ENVIRONMENT
NAME="Ubuntu" VERSION="20.04.4 LTS (Focal Fossa)"
STEPS TO REPRODUCE
- name: "Drain node {{ inventory_hostname|lower }}, even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it."
kubernetes.core.k8s_drain:
state: drain
name: "{{ inventory_hostname|lower }}"
kubeconfig: ~/.kube/config
delete_options:
ignore_daemonsets: yes
delete_emptydir_data: yes
force: yes
terminate_grace_period: 5
wait_sleep: 20
delegate_to: localhost
EXPECTED RESULTS
TASK [Drain node 192.168.122.11, even if there are pods not managed by a ReplicationController, Job, or DaemonSet on it.] ***************************************************************************************** changed: [192.168.122.11 -> localhost]
ACTUAL RESULTS
The full traceback is: File "/tmp/ansible_kubernetes.core.k8s_drain_payload_favq191w/ansible_kubernetes.core.k8s_drain_payload.zip/ansible_collections/kubernetes/core/plugins/modules/k8s_drain.py", line 324, in evict_pods File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 7652, in create_namespaced_pod_eviction return self.create_namespaced_pod_eviction_with_http_info(name, namespace, body, **kwargs) # noqa: E501 File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 7759, in create_namespaced_pod_eviction_with_http_info return self.api_client.call_api( File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 348, in call_api return self.__call_api(resource_path, method, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 180, in __call_api response_data = self.request( File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 391, in request return self.rest_client.POST(url, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 275, in POST return self.request("POST", url, File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 234, in request raise ApiException(http_resp=r) fatal: [192.168.122.11 -> localhost]: FAILED! => { "changed": false, "invocation": { "module_args": { "api_key": null, "ca_cert": null, "client_cert": null, "client_key": null, "context": null, "delete_options": { "delete_emptydir_data": true, "disable_eviction": false, "force": true, "ignore_daemonsets": true, "terminate_grace_period": 5, "wait_sleep": 20, "wait_timeout": null }, "host": null, "impersonate_groups": null, "impersonate_user": null, "kubeconfig": "/home/imre/.kube/config", "name": "192.168.122.11", "no_proxy": null, "password": null, "persist_config": null, "proxy": null, "proxy_headers": null, "state": "drain", "username": null, "validate_certs": null } }, "msg": "Failed to delete pod longhorn-system/instance-manager-e-f5feaabb due to: Too Many Requests" }
@impsik Thanks for filing the issue. The eviction API can sometimes return a 429 Too Many Requests status, especially if the eviction would violate the pod disruption budget. It's unclear if that's what is going on in your case or not. We should be retrying 429 responses here, and we aren't. As a workaround until this is fixed, you could try using Ansible's built-in retry logic.
@gravesm My setup was (i blew it up): 3 etcd/cp nodes, 3 worker nodes. SSD disks.
Node config:
OS type and version: Ubuntu 20.04.4 LTS
CPU per node: 4
Memory per node: 8GB
I installed Longhorn, 3 replicas, trough Rancher apps.
I installed Wordpress for testing LINK and this created 2 PVC, one for SQL, one for wordpress.
From the Longhorn docs i found that i also need to use --pod-selector='app!=csi-attacher,app!=csi-provisioner'
However pod-selector option is not supported by kubernetes.core.k8s_drain
$ kubectl get poddisruptionbudgets -n longhorn-system
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
instance-manager-e-5369c190 1 N/A 0 19m
instance-manager-e-662fd89d 1 N/A 0 13m
instance-manager-e-b944b669 1 N/A 0 16m
instance-manager-r-26f65c39 1 N/A 0 16m
instance-manager-r-6030e0e8 1 N/A 0 13m
instance-manager-r-6314b679 1 N/A 0 19m
Another workaround for me was to use shell module, works fine:
shell: kubectl drain {{ inventory_hostname|lower }} --ignore-daemonsets --delete-emptydir-data --force --pod-selector='app!=csi-attacher,app!=csi-provisioner' --kubeconfig ~/.kube/config
I also can confirm this behavior. It would be nice if the kubernetes.core module can support the pod-selector configuration. The usage of the shell module is just a "dirty workaround".
Same here, few Pods on a Kubernetes control-plane node, no PDB on the Pods. Sometimes it does not evict any Pods.
The only reliable workaround for the moment is to revert to kubectl
.