postgres-operator
postgres-operator copied to clipboard
aws eks 1.21 Bound Service Account Token Volume fails postgres-operator and pods runs into readonly mode
- Which image of the operator are you using? e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.6.1
- Where do you run it - cloud or metal? Kubernetes or OpenShift? [AWS K8s 1.21
- Are you running Postgres Operator in production? yes
- **Type of issue?**question
After upgrading to 1.21 eks AWS we fased issue of outdated serviceaccount token (https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens). Postgresql-operator is set to use podtgres-pod serviceaccount. After 90 days after upgrading eks cluster pods that are 90d old faced this error in postgres pods:
2022-05-25 00:31:10,670 ERROR: Unexpected error from Kubernetes API
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 481, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 1012, in touch_member
ret = self._api.patch_namespaced_pod(self._name, self._namespace, body)
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 466, in wrapper
return getattr(self._core_v1_api, func)(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 402, in wrapper
return self._api_client.call_api(method, path, headers, body, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 371, in call_api
return self._handle_server_response(response, _preload_content)
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 201, in _handle_server_response
raise k8s_client.rest.ApiException(http_resp=response)
patroni.dcs.kubernetes.K8sClient.rest.ApiException: (401)
Reason: Unauthorized
and this one:
2022-05-25 02:33:10,501 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:10,501 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:11,507 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:11,508 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:12.222 39 LOG {ticks: 0, maint: 0, retry: 0}
2022-05-25 02:33:12,513 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:12,514 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:13,524 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:13,525 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:14,532 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:14,532 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:15,547 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:15,547 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:16,564 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:16,565 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:17,572 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:17,572 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:18,380 ERROR: get_cluster
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 701, in _load_cluster
self._wait_caches(stop_time)
File "/usr/local/lib/python3.6/dist-packages/patroni/dcs/kubernetes.py", line 693, in _wait_caches
raise RetryFailedError('Exceeded retry deadline')
patroni.utils.RetryFailedError: 'Exceeded retry deadline'
2022-05-25 02:33:18,380 ERROR: Error communicating with DCS
2022-05-25 02:33:18,381 INFO: DCS is not accessible
2022-05-25 02:33:18,382 WARNING: Loop time exceeded, rescheduling immediately.
2022-05-25 02:33:18,580 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:18,581 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:19,591 ERROR: ObjectCache.run ApiException()
2022-05-25 02:33:19,591 ERROR: ObjectCache.run ApiException()
Is there any option to set refresh time for tokens? We solved it deleting pods one by one, but this is not an option in long run
Further investigation: We found such commit in zalando/patroni: https://github.com/zalando/patroni/commit/aa0cd480604069519ebd9b52b0d629e33287341c seems like this one is refreshing needed token, but this commit is only in master without any release, so spilo image is not using it, too. I'll ask it in issues in patroni too
Begs the question as to why patroni isn't using the official Python client for Kubernetes as that would have solved / supported automatically after version 12.0.0 (latest version is 24.2.0) but will reserve further thoughts / comments on that for threads in that repo.
That aside, it looks like this was released now in Patroni 2.1.4: https://github.com/zalando/patroni/blob/master/docs/releases.rst#version-214
Spilo 2.1-p6 is then which release that uses it: https://github.com/zalando/spilo/releases/tag/2.1-p6
So presumably either upgrading to https://github.com/zalando/postgres-operator/releases/tag/v1.8.2 where 2.1-p6 is the default image, or using .spec.dockerImage to override it may work: https://github.com/zalando/postgres-operator/blob/3bfd63cbe624eb303d40f6e511e987f4343bb1d7/pkg/controller/operator_config.go#L42
We will take the approach of upgrading the chart and confirm the latest Spilo / Patroni is automatically applied.
Hi, any update on the issue?
We just built new spilo-patroni image and used it. I think this problem is already solved in newer versions of patroni, so just update your version