kopf When embedding, setting an invalid KUBECONFIG breaks kopf start-up and termination

Long story short

Disclaimer: I'm new to kopf so I am not 100% confident I'm doing everything correct here and are not triggering this by accident. Here's hoping I'm not wasting anybodies time :)

When running kopf embedded an invalid KUBECONFIG env setting breaks start-up here: https://github.com/nolar/kopf/blob/1c2488330ba35492bd50d24b324550d947194ecd/kopf/reactor/running.py#L248-L251

Ultimately this leads to a non-exiting kopf thread after signalling stop_flag from the main thread. The process has to be killed manually otherwise it hangs "forever".

Note

In master peering detection seems to have received some TLC which might fix this (haven't tried), however that revision is not released: https://github.com/nolar/kopf/blob/9fe89e94a70ebd2a242b74aec6d6f3403da3bd88/kopf/reactor/running.py#L247-L266

Description

While developing locally I usually manage env KUBECONFIG to point to the correct kube config. When working on our applications error handling code I deliberately broke KUBECONFIG by setting it to KUBECONFIG=missing, which exposed the behavior.

This wrapper seems to be responsible for throwing the LoginError which bubbles up to running.py: https://github.com/nolar/kopf/blob/1c2488330ba35492bd50d24b324550d947194ecd/kopf/clients/auth.py#L34-L52

The should reproduce it OOTB

import logging
import asyncio
import os
import threading
import contextlib
import time

import kopf

logger = logging.getLogger(__name__)


class KopfRunner(object):

    def __init__(self) -> None:
        self.readyEvent = threading.Event()
        self.stopFlag = threading.Event()

        self.kopfThread = threading.Thread(name="kopf-main", target=self.__setup_kopf_event_loop, kwargs=dict(
            stop_flag=self.stopFlag,
            ready_flag=self.readyEvent,
        ))

    @kopf.on.login(retries=1, backoff=3.0)
    def login_fn(**kwargs):
        return kopf.login_via_pykube(**kwargs)

    @staticmethod
    def __setup_kopf_event_loop(ready_flag: threading.Event, stop_flag: threading.Event):
        kopf_loop = asyncio.new_event_loop()
        asyncio.set_event_loop(kopf_loop)

        with contextlib.closing(kopf_loop):
            kopf.configure(verbose=True, debug=True)
            kopf_loop.run_until_complete(kopf.operator(
                ready_flag=ready_flag,
                stop_flag=stop_flag,
            ))

    def start(self):
        logger.info("Starting kopf...")

        self.kopfThread.start()
        self.readyEvent.wait()

        logger.info("Kopf ready.")

    def stop(self):
        logger.info("Signaling kopf termination...")

        self.stopFlag.set()
        self.kopfThread.join()

        logger.info("kopf terminated...")


if __name__ == "__main__":
    # making this point to a valid kube config instead of "missing" will properly terminate this script after 10s.
    # setting it to any value not pointing to a valid config will prevent kopf thread termination.
    os.environ["KUBECONFIG"] = "missing"

    runner = KopfRunner()
    runner.start()

    time.sleep(10)

    runner.stop()

The exact command to reproduce the issue

# put the above sample code in a file named bug.py then run
python3 bug.py

The full output of the command that failed

[2020-12-07 18:01:31,586] __main__             [INFO    ] Kopf ready.
[2020-12-07 18:01:31,588] kopf.reactor.activit [INFO    ] Initial authentication has been initiated.
[2020-12-07 18:01:31,588] kopf.activities.auth [DEBUG   ] Activity 'KopfRunner.login_fn' is invoked.
[2020-12-07 18:01:31,590] kopf.activities.auth [ERROR   ] Activity 'KopfRunner.login_fn' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/utilities/piggybacking.py", line 95, in login_via_pykube
    config = auth.get_pykube_cfg()
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/clients/auth.py", line 263, in get_pykube_cfg
    raise NotImplementedError("get_pykube_cfg() is not supported unless monkey-patched.")
NotImplementedError: get_pykube_cfg() is not supported unless monkey-patched.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/utilities/piggybacking.py", line 99, in login_via_pykube
    config = pykube.KubeConfig.from_service_account()
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/pykube/config.py", line 41, in from_service_account
    with service_account_dir.joinpath("token").open() as fp:
  File "/usr/lib/python3.9/pathlib.py", line 1241, in open
    return io.open(self, mode, buffering, encoding, errors, newline,
  File "/usr/lib/python3.9/pathlib.py", line 1109, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/var/run/secrets/kubernetes.io/serviceaccount/token'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/utilities/piggybacking.py", line 103, in login_via_pykube
    config = pykube.KubeConfig.from_file()
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/pykube/config.py", line 92, in from_file
    raise exceptions.PyKubeError(
pykube.exceptions.PyKubeError: Configuration file missing not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/handling.py", line 264, in execute_handler_once
    result = await invoke_handler(
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/handling.py", line 366, in invoke_handler
    result = await invocation.invoke(
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/invocation.py", line 147, in invoke
    await asyncio.shield(future)  # slightly expensive: creates tasks
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/cs/PycharmProjects/[redacted]/packages/ts_runner_operator/test/bug.py", line 26, in login_fn
    return kopf.login_via_pykube(**kwargs)
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/utilities/piggybacking.py", line 106, in login_via_pykube
    raise credentials.LoginError(f"Cannot authenticate pykube "
kopf.structs.credentials.LoginError: Cannot authenticate pykube neither in-cluster, nor via kubeconfig.
[2020-12-07 18:01:34,595] kopf.activities.auth [DEBUG   ] Activity 'KopfRunner.login_fn' is invoked.
[2020-12-07 18:01:34,595] kopf.activities.auth [ERROR   ] Activity 'KopfRunner.login_fn' failed permanently: Activity 'KopfRunner.login_fn' has exceeded 1 retries.
[2020-12-07 18:01:34,596] kopf.reactor.running [ERROR   ] Root task 'credentials retriever' is failed: One or more handlers failed.
Traceback (most recent call last):
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/handling.py", line 262, in execute_handler_once
    raise HandlerRetriesError(f"{handler} has exceeded {state.retries} retries.")
kopf.reactor.handling.HandlerRetriesError: Activity 'KopfRunner.login_fn' has exceeded 1 retries.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/running.py", line 441, in _root_task_checker
    await coro
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/activities.py", line 51, in authenticator
    await authenticate(
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/activities.py", line 75, in authenticate
    activity_results = await run_activity(
  File "/home/cs/PycharmProjects/[redacted]/venv/lib/python3.9/site-packages/kopf/reactor/activities.py", line 125, in run_activity
    raise ActivityError("One or more handlers failed.", outcomes=outcomes) from exceptions[0]
kopf.reactor.activities.ActivityError: One or more handlers failed.
[2020-12-07 18:01:41,597] __main__             [INFO    ] Signaling kopf termination...
[2020-12-07 18:01:41,597] kopf.reactor.running [INFO    ] Stop-flag is set to True. Operator is stopping.

The full output of the command that works a intended, when `KUBECONFIG` is valid

[2020-12-07 18:02:15,331] __main__             [INFO    ] Kopf ready.
[2020-12-07 18:02:15,333] kopf.reactor.activit [INFO    ] Initial authentication has been initiated.
[2020-12-07 18:02:15,333] kopf.activities.auth [DEBUG   ] Activity 'KopfRunner.login_fn' is invoked.
[2020-12-07 18:02:15,342] kopf.activities.auth [DEBUG   ] Pykube is configured via kubeconfig file.
[2020-12-07 18:02:15,344] kopf.activities.auth [INFO    ] Activity 'KopfRunner.login_fn' succeeded.
[2020-12-07 18:02:15,344] kopf.reactor.activit [INFO    ] Initial authentication has finished.
[2020-12-07 18:02:15,388] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f8e7efe2bb0> starts SSL handshake
[2020-12-07 18:02:15,432] asyncio              [DEBUG   ] <asyncio.sslproto.SSLProtocol object at 0x7f8e7efe2bb0>: SSL handshake took 44.0 ms
[2020-12-07 18:02:15,433] asyncio              [DEBUG   ] <asyncio.TransportSocket fd=6, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('172.18.66.162', 42364), raddr=('172.19.1.182', 6443)> connected to 172.19.1.182:6443: (<asyncio.sslproto._SSLProtocolTransport object at 0x7f8e7df01580>, <aiohttp.client_proto.ResponseHandler object at 0x7f8e7efb32e0>)
[2020-12-07 18:02:15,565] kopf.engines.peering [WARNING ] Default peering object not found, falling back to the standalone mode.
[2020-12-07 18:02:15,565] kopf.reactor.running [WARNING ] OS signals are ignored: running not in the main thread.
[2020-12-07 18:02:25,339] __main__             [INFO    ] Signaling kopf termination...
[2020-12-07 18:02:25,340] kopf.reactor.running [INFO    ] Stop-flag is set to True. Operator is stopping.
[2020-12-07 18:02:25,342] kopf.reactor.running [DEBUG   ] Root task 'credentials retriever' is cancelled.
[2020-12-07 18:02:25,343] kopf.reactor.running [DEBUG   ] Root task 'poster of events' is cancelled.
[2020-12-07 18:02:25,343] kopf.reactor.running [DEBUG   ] Root task 'daemon killer' is cancelled.
[2020-12-07 18:02:25,345] kopf.reactor.running [DEBUG   ] Root tasks are stopped: finishing normally; tasks left: set()
[2020-12-07 18:02:25,345] kopf.reactor.running [DEBUG   ] Hung tasks stopping is skipped: no tasks given.
[2020-12-07 18:02:25,346] asyncio              [DEBUG   ] Close <_UnixSelectorEventLoop running=False closed=False debug=True>
[2020-12-07 18:02:25,347] __main__             [INFO    ] kopf terminated...

Environment

Kopf version: 0.28.1
Kubernetes version: 1.17
Python version: 3.9.1rc1
OS/platform: linux

Python packages installed, excluding proprietary packages

aioconsole==0.3.1
aioelasticsearch==0.7.0
aiohttp==3.7.3
aiojobs==0.3.0
aiomonitor==0.4.5
amqp==5.0.2
appdirs==1.4.4
APScheduler==3.6.3
arrow==0.17.0
async-timeout==3.0.1
asynctest==0.13.0
asynqp==0.6
attrs==20.3.0
billiard==3.6.3.0
black==20.8b1
bleach==3.2.1
blinker==1.4
cachetools==4.1.1
celery==5.0.2
certifi==2020.11.8
cffi==1.14.4
chardet==3.0.4
click==5.1
click-didyoumean==0.0.3
click-repl==0.1.6
colorama==0.4.4
croniter==0.3.36
cryptography==3.2.1
distro==1.5.0
docker==4.4.0
docutils==0.16
elasticsearch==7.10.0
elasticsearch-dsl==7.3.0
fcache==0.4.7
flake8==3.8.4
flake8-codeclimate==0.3.0
flake8-comprehensions==3.3.0
freezegun==1.0.0
future==0.18.2
gitdb==4.0.5
GitPython==3.1.11
google-auth==1.23.0
httplib2==0.18.1
humanize==3.1.0
idna==2.10
iniconfig==1.1.1
iso8601==0.1.13
isort==5.6.4
janus==0.6.1
jeepney==0.6.0
Jinja2==2.11.2
json-tricks==3.15.5
keyring==21.5.0
kombu==5.0.2
kopf==0.28.1
kubernetes==12.0.1
ldap3==2.8.1
lxml==4.6.2
lz4==3.1.1
MarkupSafe==1.1.1
matrix-client==0.3.2
mccabe==0.6.1
mmh3==2.5.1
multidict==5.1.0
mypy-extensions==0.4.3
natsort==7.1.0
oauth2client==4.1.3
oauthlib==3.1.0
packaging==20.7
pathspec==0.8.1
pip==20.1.1
pkg-resources==0.0.0
pkginfo==1.6.1
pluggy==0.13.1
prometheus-client==0.9.0
prompt-toolkit==3.0.8
psutil==5.7.3
psycopg2-binary==2.8.6
py==1.9.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.6.0
pycparser==2.20
pydantic==1.7.3
pyflakes==2.2.0
Pygments==2.7.2
pykube-ng==20.10.0
pyOpenSSL==20.0.0
pyparsing==2.4.7
pystack-debugger==0.9.0
pytest==6.1.2
pytest-aiohttp==0.3.0
pytest-asyncio==0.14.0
pytest-freezegun==0.4.2
pytest-mock==3.3.1
python-dateutil==2.8.1
python-json-logger==2.0.1
pytz==2020.4
PyYAML==5.3.1
readme-renderer==28.0
regex==2020.11.13
requests==2.25.0
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rfc3986==1.4.0
rsa==4.6
sd-notify==0.1.0
SecretStorage==3.3.0
sentry-sdk==0.19.4
setproctitle==1.2
setuptools==44.0.0
setuptools-scm==4.1.2
six==1.15.0
smmap==3.0.4
SQLAlchemy==1.3.20
tabulate==0.8.7
termcolor==1.1.0
terminaltables==3.1.0
TestLink-API-Python-client==0.8.1
toml==0.10.2
tqdm==4.54.0
twine==3.2.0
typed-ast==1.4.1
typing-extensions==3.7.4.3
tzlocal==2.1
UnleashClient==3.5.0
urllib3==1.26.2
uvloop==0.14.0
vine==5.0.0
voluptuous==0.12.0
wcwidth==0.2.5
webencodings==0.5.1
websocket-client==0.57.0
wheel==0.36.0
yarl==1.6.3
zstandard==0.14.0
zulip==0.7.1

Dec 07 '20 18:12 paxbit

Thanks for reporting it with a fully reproducible code snippet!

Indeed, it is a bug. And it is reproducible in the main branch too, despite the changes (they do not change how such an error escalates).

The failure mechanics:

As I can see, this failure happens this way:

Some asyncio tasks are created before peering.detect() is called. One of them is the "credentials retriever", which populates a credentials vault with credentials from the on-login handlers.

Normally, several more tasks are created afterwards, and an equivalent of asyncio.wait() (specifically, run_tasks()) is used to wait until at least one of them exits/fails. At that moment, all other "root tasks" are also cancelled.

In this case, one of the "root tasks" — the "credentials retriever" — fails and exits. However, the operator is not in its task-waiting state (and that's the trigger!), it is trying to perform a raw API request in peering.detect() — the only API request not wrapped into any tasks.

That API call internally waits for the credentials vault to be populated and signalled with events. However, since the "credentials retriever" is gone to the moment, nothing is even trying to populate the vault or set its events/flags to awaken the authentication wrapper. And so, detect() waits forever for the credentials to appear, which are not going to appear.

If we would like to fix that, the place to probably look at would be this one in kopf.reactor.activities.authenticate() — to wrap the call with try-except this way:

    try:
        activity_results = await run_activity(
            lifecycle=lifecycles.all_at_once,
            registry=registry,
            settings=settings,
            activity=handlers_.Activity.AUTHENTICATION,
        )
    except ActivityError:
        await vault.populate({})
        raise

What it does here, is signalling the vault to wake up shortly, and wake up all API calls waiting for the credentials — just to see that nothing is going to appear there, and let them fail.

With this hack applied, the operator exits (it fails with some asyncio errors, so it exits disgracefully, which requires extra attention — but the operator() indeed exits).

Plans:

However, it is not worth fixing for one reason:

There is one branch — I've created a preview PR #600 — full of dirty changes, i.e. not finished, though functional (i.e. it does run). You can read the documentation drafts in the diff linked above to understand what this branch is for.

Among other things, this branch fully reworks how peering detection is done. It does not perform a raw API request from the task-spawning routine anymore. These checks are now performed at runtime, and are toggled on/off multiple times if needed (before: only once at startup time).

For this reason, the issue is not reproducible in this version, as there is no place where it can happen the same way — i.e. by blocking on an API request not wrapped to the "root tasks". The operator exits normally (well, with an expected exception about credentials) with the same code snippet you provided.

This change is the last blocker of the 0.29 milestone. All prerequisites are done. I am now trying to extract unrelated changes as separate PRs, write unit-tests, test it manually in certain edge cases, ensure backward compatibility, and generally finish the task. Once finished, it becomes 0.29.0.

If that fix above is applied, it will "lose" its healing effect soon. So I suggest that we do not fix it for now. In your case, you can temporarily disable the peering detection by disabling the peering at all with --standalone or operator(standalone=True, …).

Dec 07 '20 22:12 nolar

Thanks for the explanation. I agree with your rationale regarding waiting for 0.29 and since we do not need peering yet there is no problem in running standalone. I tried it and it works.

Do you agree using @kopf.on.cleanup() is the correct way reacting to the failed auth on startup? This is what I'm using at the moment for notifying the main app kopf has died. However I'd like to have the actual cleanup reason available in the handler as well which is not provided as of now. So you have a suggestion how to achieve this? Is there a way to let the main know what exactly went wrong? Especially auth problems would be interesting here.

Dec 08 '20 11:12 paxbit

@paxbit Semantically, the cleanup happens when the operator exits for any reason, so yes, it is the correct way.

But keep in mind, that the cleanup handler is not guaranteed to execute for reasons it cannot control (e.g. SIGKILL or OOM or alike) — as documented. In normal circumstances, it will be attempted at least. If it is not, that is a bug.

Except for this same issue with peering.detect() — because the operator is not finishing its tasks properly, including the startup-cleanup activities. So, on-cleanup will not be called in this case (I checked).

Though, usually, just waiting for a task with operator() exiting should be sufficient to know that the operator has died. Can you please hint me why you need a cleanup activity for that?

PS: Also see #347 — indirectly related to your approach with flags: can be used to pass arbitrary flags/futures/events from outside of the operator directly into its handlers.

Dec 09 '20 08:12 nolar

Thank you!

Though, usually, just waiting for a task with operator() exiting should be sufficient to know that the operator has died. Can you please hint me why you need a cleanup activity for that?

Well generally you're right however I find it more expressive to have a decorated clean_up function next to the other handlers instead of simply doing stuff after loop.run_until_complete(...). Just for transparency. Also: Since due to this very bug the operator() task never completes I was looking at other options getting notified about any unclean self-induced operator exit. Basically I was asking, on a higher level: How can I recycle a pod hosting a crashed-but-non-terminating embedded operator by making the host applications main thread responding to that situation? Maybe I could (actually not reliably, see below) liveness probe however that temporally decouples the actual error from its detection and adds one more step to perform when looking for the root cause. Why not reliably? The liveness-endpoint is alive and happy even after triggering this bug.

I still would like to have a facility to hook/listen/catch externally for exceptions raised within kopf when embedding it. At least for the unexpected exceptions category. And especially for the authentication case I currently do not understand (i.e. have not looked for or tried it) what happens should authentication expire down the road mid-lifecylce? Can we detect that and/or will kopf exit on its own?

PS: Also see #347 — indirectly related to your approach with flags: can be used to pass arbitrary flags/futures/events from outside of the operator directly into its handlers.

Thanks for the tip! However I may be missing sth. as #347 does seem to ask questions and shows some me too notes. I do not see anything making the issue satisfy "can be used to" though. Maybe that's still in progress so I subscribed it.

Dec 09 '20 15:12 paxbit