Custom resource watchers die but operator does not in environments with restricted Kubernetes API access
Long story short
When the operator's service account has limited access to the Kubernetes cluster (such as an RBAC that only gives it access to the current namespace), the watchers may die (such as due to a temporary auth issue) and never recover. This results in the operator continuing to run, but not monitoring resources for changes anymore. This appears to only happen for operators that are handling custom resources.
Kopf version
1.37.2
Kubernetes version
1.29.5
Python version
3.10.14
Related Issues
- #980
Code
# monkeypatch the errors.check_response method so we can simulate when an auth error occurs via
# pkill -SIGUSR1 -nf kopf
import logging
import kopf
from kopf._cogs.clients import errors
logger = logging.getLogger(__name__)
old_check_response = errors.check_response
BROKEN_AUTH = False
def check_response(*args, **kwargs):
logger.info("Running monkey patched checked response")
if BROKEN_AUTH:
logger.info("Auth is broken, raising error.")
raise errors.APIUnauthorizedError(None, status=401)
return old_check_response(*args, **kwargs)
errors.check_response = check_response
import signal
def break_auth(*_):
global BROKEN_AUTH
logger.info("Breaking auth")
BROKEN_AUTH = True
signal.signal(signal.SIGUSR1, break_auth)
@kopf.on.update(
CR_GROUP,
CR_VERSION,
CR_KIND,
)
@kopf.on.create(
CR_GROUP,
CR_VERSION,
CR_KIND,
)
def monitor_custom_resource(
name: str,
namespace: str,
status: kopf.Status,
labels: kopf.Labels,
**_,
): ...
Logs
[2024-12-10 15:43:22,060] kopf._core.engines.a [INFO ] Initial authentication has been initiated.
[2024-12-10 15:43:22,070] kopf.activities.auth [INFO ] Activity 'login_via_client' succeeded.
[2024-12-10 15:43:22,070] kopf._core.engines.a [INFO ] Initial authentication has finished.
[2024-12-10 15:43:22,080] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,081] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,083] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,084] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,087] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,088] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,091] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,091] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,091] kopf._core.reactor.o [WARNING ] Not enough permissions to list namespaces. Falling back to a list of namespaces which are assumed to exist: {'default'}
[2024-12-10 15:43:22,093] kopf._core.reactor.o [WARNING ] Not enough permissions to watch for resources: changes (creation/deletion/updates) will not be noticed; the resources are only refreshed on operator restarts.
[2024-12-10 15:43:22,094] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,094] kopf._core.reactor.o [WARNING ] Not enough permissions to watch for namespaces: changes (deletion/creation) will not be noticed; the namespaces are only refreshed on operator restarts.
[2024-12-10 15:43:22,115] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:22,126] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:43,996] __kopf_script_0__/Us [INFO ] Breaking auth
[2024-12-10 15:43:48,157] __kopf_script_0__/Us [INFO ] Running monkey patched checked response
[2024-12-10 15:43:48,157] __kopf_script_0__/Us [INFO ] Auth is broken, raising error.
[2024-12-10 15:43:48,158] kopf._core.engines.a [INFO ] Re-authentication has been initiated.
[2024-12-10 15:43:48,167] kopf.activities.auth [INFO ] Activity 'login_via_client' succeeded.
[2024-12-10 15:43:48,167] kopf._core.engines.a [INFO ] Re-authentication has finished.
[2024-12-10 15:43:48,167] kopf.objects [ERROR ] [default/custom-resource] Throttling for 1 seconds due to an unexpected error: LoginError('Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/')
Traceback (most recent call last):
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 50, in wrapper
response = await fn(*args, **kwargs, context=context)
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 85, in request
await errors.check_response(response) # but do not parse it!
File "/Users/jamesmchugh/git/operators/test_operator_bug.py", line 23, in check_response
raise errors.APIUnauthorizedError(None, status=401)
kopf._cogs.clients.errors.APIUnauthorizedError: (None, None)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/throttlers.py", line 44, in throttled
yield should_run
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/reactor/processing.py", line 130, in process_resource_event
applied = await application.apply(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 60, in apply
await patch_and_check(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 131, in patch_and_check
resulting_body = await patching.patch_obj(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/patching.py", line 47, in patch_obj
patched_body = await api.patch(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 155, in patch
response = await request(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 56, in wrapper
await vault.invalidate(key, exc=e)
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 297, in invalidate
raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2024-12-10 15:43:49,168] kopf.objects [INFO ] [default/custom-resource] Throttling is over. Switching back to normal operations.
[2024-12-10 15:43:49,169] kopf.objects [ERROR ] [default/custom-resource] Throttling for 1 seconds due to an unexpected error: LoginError('Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/')
Traceback (most recent call last):
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/throttlers.py", line 44, in throttled
yield should_run
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/reactor/processing.py", line 130, in process_resource_event
applied = await application.apply(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 60, in apply
await patch_and_check(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/actions/application.py", line 131, in patch_and_check
resulting_body = await patching.patch_obj(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/patching.py", line 47, in patch_obj
patched_body = await api.patch(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 155, in patch
response = await request(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 48, in wrapper
async for key, info, context in vault.extended(APIContext, 'contexts'):
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
async for key, item in self._items():
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
yielded_key, yielded_item = self.select()
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
[2024-12-10 15:43:50,165] kopf._core.reactor.q [WARNING ] Unprocessed streams left for [(custom-resource.v1beta1.foo.com, 'd782af8b-1cf4-42bc-abc3-c02ff635470f')].
[2024-12-10 15:43:50,166] kopf._core.reactor.o [ERROR ] Watcher for custom-resource.v1beta1.foo.com@default has failed: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
Traceback (most recent call last):
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/aiokits/aiotasks.py", line 96, in guard
await coro
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_core/reactor/queueing.py", line 175, in watcher
async for raw_event in stream:
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 86, in infinite_watch
async for raw_event in stream:
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 201, in continuous_watch
async for raw_input in stream:
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/watching.py", line 266, in watch_objs
async for raw_input in api.stream(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/api.py", line 200, in stream
response = await request(
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/clients/auth.py", line 48, in wrapper
async for key, info, context in vault.extended(APIContext, 'contexts'):
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 158, in extended
async for key, item in self._items():
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 195, in _items
yielded_key, yielded_item = self.select()
File "/Users/jamesmchugh/anaconda3/envs/python-3.10/lib/python3.10/site-packages/kopf/_cogs/structs/credentials.py", line 214, in select
raise LoginError("Ran out of valid credentials. Consider installing "
kopf._cogs.structs.credentials.LoginError: Ran out of valid credentials. Consider installing an API client library or adding a login handler. See more: https://kopf.readthedocs.io/en/stable/authentication/
# operator continues running, but doing nothing
Additional information
To reproduce this scenario, create a CRD and set the CR_* vars in the code above. Additionally, create a service account with roles that only have access to the resources in the namespace the operator is monitoring, such as below:
apiVersion: v1
kind: ServiceAccount
metadata:
name: operator-test
namespace: default
automountServiceAccountToken: true
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: operator-test
namespace: default
rules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: operator-test
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: operator-test
subjects:
- kind: ServiceAccount
name: operator-test
namespace: "default"
Create a token for that service account (kubectl create token operator-test) and add it as a new user to your kubeconfig. Change contexts so this new context with the new user is being actively used.
Run the operator with
kopf run -n default <filename>
After startup, issue the SIGUSR1 signal to the process trigger the monkeypatched auth method to raise an AuthenticationError error next time it runs.
pkill -SIGUSR1 -nf kopf
Create or update the custom resource. Observe that the operator logs an error but continues running. Future create/update events of the custom resource (or any other resource if multiple handlers are used) are not observed.
In an environment without restricted access to a single namespace, the resource-observer and namespace-observer tasks run which are core operator tasks. Therefore, an error such as an auth failure will cause those tasks to fail and the operator to die. This is not the case when observing a single namespace.
Additionally, for resources that are not custom, the event-poster core task uses the events API to report when handlers succeed/fail. This too will fail in the face of an auth issue, causing the event-poster task to die and the operator to then die.
In the case of observing a custom resource within a single namespace, neither of the above safety nets gets tiggered. This results in the operator silently dying. From reviewing the code, I think a fix to this could be for to update https://github.com/nolar/kopf/blob/c158baee82486b731496c7eec01fe60fcea2efce/kopf/_core/reactor/orchestration.py#L104-L132 so that orchestrator checks monitors the status of the tasks in the ensemble, and raises an exception if they fail. An example is below
async def ochestrator(
*,
processor: queueing.WatchStreamProcessor,
settings: configuration.OperatorSettings,
identity: peering.Identity,
insights: references.Insights,
operator_paused: aiotoggles.ToggleSet,
) -> None:
peering_missing = await operator_paused.make_toggle(name='peering CRD is missing')
ensemble = Ensemble(
peering_missing=peering_missing,
operator_paused=operator_paused,
operator_indexed=aiotoggles.ToggleSet(all),
)
try:
async with insights.revised:
while True:
wait_for_insights_task = aiotasks.create_guarded_task(insights.revised.wait(), "wait-for-insights")
done, pending = await aiotasks.wait([wait_for_insights_task, *ensemble.get_tasks(ensemble.get_keys())], return_when=asyncio.FIRST_COMPLETED)
for task in done:
if task.exception() is not None:
raise task.exception()
if wait_for_insights_task.done():
await adjust_tasks(
processor=processor,
insights=insights,
settings=settings,
identity=identity,
ensemble=ensemble,
)
except asyncio.CancelledError:
tasks = ensemble.get_tasks(ensemble.get_keys())
await aiotasks.stop(tasks, title="streaming", logger=logger, interval=10)
raise
Now, watcher tasks whose exit status was not previously monitored are now monitored, and exceptions in them will cause the operator to exit.
I am not sure if there are other side-effects of this approach though
For some additional context, the auth related error I mentioned is the one trigger I found to reproduce this issue. However, it may not be the only trigger
In theory, this can also be reproduced by dropping all of the signal handling and monkeypatching from the above code, and instead just deleting the service account (or removing its rolebinding/role) to trigger the bug.
@james-mchugh I agree, seems related to #1158 and #980. I have hit the scenario described here (watchers die, the operator continues) a few times while debugging those two issues. But it is different — here, something is wrong with the error escalation (or a lack of it) from the "orchestrator". It happened even with non-http errors — though not always, rather probabilistically, but nevertheless 50/50 reproducible.
First of all, thanks for the detailed investigation — the explanation seems totally valid and makes the chain of events clear. Second, the proposed solution also seems promising. Though I have some concerns about the guarded task & potential resource leaks (abandoned unawaited tasks/coros) — I need to think on this, and experiment with the repro a few times; and also double-check if #1031 maybe fixes this issue too — as a side effect.
Unfortunately, I have a somewhat limited capability to code now, so I code in small chunks & from time to time, and so I have to prioritize the issues. Once I get the feedback & merge the fix for those two issues above, I will take a look at this one (since it is adjacent code-wise). No timeline promises though, sorry.
No problem. Thank you for taking the time to look into this too! I'm happy to help with a contribution as well. I know the solution I presented may not be the best. I'm happy to look into contributing a fix to this if another solution comes to your mind, but you don't have the cycles to implement it.