kopf icon indicating copy to clipboard operation
kopf copied to clipboard

Opearator handler shows connection rest by peer or sometimes just hangs and doesn't respond

Open amolkavitkar opened this issue 4 years ago • 5 comments

Expected Behavior

The pod should respond immediately

Actual Behavior

  • When operator pod is long standing, and if we send and create request for custom resource it just hangs and doesn't respond, we have to restart pod then it comes back alive

  • Connection reset by peer in middle of creating resources

Steps to Reproduce the Problem

  1. Create an operator pod and keep it long running for few days
  2. try to create custom resource it doesn't go through

Note (sometimes it gives connection reset by peer)

Specifications

  • Platform: Ubuntu 16.04
  • Kubernetes version: (use kubectl version) 1.15
  • Python version: (use python --version) 3.5.2
  • Python packages installed: (use pip freeze --all)

connection reset by peer traceback Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 387, in _execute lifecycle=lifecycle, # just a default for the sub-handlers, not used directly. File "/usr/local/lib/python3.7/site-packages/kopf/reactor/handling.py", line 478, in _call_handler **kwargs, File "/usr/local/lib/python3.7/site-packages/kopf/reactor/invocation.py", line 79, in invoke result = await loop.run_in_executor(config.WorkersConfig.get_syn_executor(), real_fn) File "/usr/local/lib/python3.7/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/handlers.py", line 159, in create_fn obj = v1_client.create_namespaced_service(namespace, SVC_NOSELECTOR_TEMPLATE) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 6934, in create_namespaced_service (data) = self.create_namespaced_service_with_http_info(namespace, body, **kwargs) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/core_v1_api.py", line 7025, in create_namespaced_service_with_http_info collection_formats=collection_formats) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api _return_http_data_only, collection_formats, _preload_content, _request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api _request_timeout=_request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 166, in request headers=headers) File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 72, in request **urlopen_kw) File "/usr/local/lib/python3.7/site-packages/urllib3/request.py", line 150, in request_encode_body return self.urlopen(method, url, **extra_kw) File "/usr/local/lib/python3.7/site-packages/urllib3/poolmanager.py", line 326, in urlopen response = conn.urlopen(method, u.request_uri, **kw) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 641, in urlopen _stacktrace=sys.exc_info()[2]) File "/usr/local/lib/python3.7/site-packages/urllib3/util/retry.py", line 368, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.7/site-packages/urllib3/packages/six.py", line 685, in reraise raise value.with_traceback(tb) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 603, in urlopen chunked=chunked) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.7/site-packages/urllib3/connectionpool.py", line 843, in validate_conn conn.connect() File "/usr/local/lib/python3.7/site-packages/urllib3/connection.py", line 370, in connect ssl_context=context) File "/usr/local/lib/python3.7/site-packages/urllib3/util/ssl.py", line 368, in ssl_wrap_socket return context.wrap_socket(sock) File "/usr/local/lib/python3.7/ssl.py", line 423, in wrap_socket session=session File "/usr/local/lib/python3.7/ssl.py", line 870, in _create self.do_handshake() File "/usr/local/lib/python3.7/ssl.py", line 1139, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

amolkavitkar avatar Oct 09 '19 09:10 amolkavitkar

@amolkavitkar Sorry for the delayed answer.

First of all, I see Python 3.5.2 mentioned. Kopf uses some features of Python 3.7, which did not exist earlier (such as contextvars, but also few more) — so, in theory, it cannot run on 3.5.2 (I didn't try though). Are you sure this is the right version of Python?

Second, can you please provide the Kopf version used, and a sample code with this issue reproduced (with the proprietary and unnecessary code removed)? Without the code, it is unclear what is happening and why, and what can cause the problems. In the stack trace, I see that you call the kubernetes client library with create_namespaced_service(), but this seems like an issue in that client library rather than in Kopf itself. Or, to be more specific, in the low-level SSL/HTTPS sub-libraries used by kubernetes client library.

Third, it is worth mentioning that the API requests for creation of the custom resources are, to my knowledge, non-blocking — i.e. k8s creates the resources and finishes the API request, and starts the handling in the operators after that — i.e., the operators cannot block the resource creation. If it is blocked, it is for some other reason (an assumption).

Also, some information on the scale of operator & cluster would be helpful to reproduce it: how many handled resources do you have (just the order of magnitude: 1s, 10s, 100s, 1000s?) And how often do they change (times per second/minute/hour, also a rough approximation)? High-scale operators can have their own issues with connection- & thread-pools, which has to be taken into account.

I would appreciate if this information is provided here in the issue, so that I could reproduce it or guess what happens internally.


One of the mentioned effects or the one that looks like it —operators not exiting properly— is known for some time, is well-described in #152, and is going to be fixed with asyncio/aiohttp for all HTTPS operations (drafted in #176). This happens with any sync HTTPS client libraries inside of the async app, so there is no good way to solve it now. The proper aiohttp way requires few hops to implement prerequisite changes as separate PRs (in focus right now). Currently, it is "kind-of-solved" by Kubernetes using SIGKILL on the pod after it does not exit properly with SIGTERM.

nolar avatar Oct 15 '19 16:10 nolar

We've also been having the same problem with our deployments, using kopf v0.20, although we are not experiencing any exceptions or dropped connections. In our case the pod simply stops receiving updates from the kubernetes API: Nothing is sent from the kubernetes API when a custom resource is created, or updated; and logs are no longer emitted.

Upon further investigation we've discovered that the connection between the kubernetes API and the operator is still in the still 'ESTABLISHED' state and there are no clear issues in the logs which would suggest that the API and operator can no longer communicate, however there is no longer any any data being exchanged on the connection as the octets sent and received by the kopf process remains steady (according to /proc/1/net/netstat).

I'd like to get you some sample code so you can investigate this further. However since we need this working ASAP, I've developed a liveness check based on my investigations which checks to ensure that there is some kind of communication occurring between the API and operator:

liveness-probe.sh -- To be added to the operator container

#! /bin/sh

# Files storing previous data values for comparison 
# You may want to change these file depending on frequency of /tmp deletion and frequency of liveness poll probing
LAST_DATA_FILE=/tmp/last_data
LAST_TIME_FILE=/tmp/last_time

# Process ID of kopf within container PID namespace (probably 1)
PID=1

get_data() {
        CURRENT_DATA=`cat /proc/$PID/net/netstat | grep 'IpExt' | tail -n 1 | cut -d ' ' -f8`
}


record_data() {
        echo $CURRENT_DATA > $LAST_DATA_FILE
}

record_time() {
	date > $LAST_TIME_FILE
}

get_data

if [ ! -e $LAST_DATA_FILE ]; then
        record_data
	record_time
        exit 0
fi

LAST_DATA=`cat $LAST_DATA_FILE`
LAST_TIME=`cat $LAST_TIME_FILE`

# provide some output for display in pod events
echo "previous = $LAST_DATA, current = $CURRENT_DATA, last received time = $LAST_TIME"


if [ ! $CURRENT_DATA -gt $LAST_DATA ]; then
	# implies it is equal, not neccessary to record data...
        record_data
        exit 1
fi

record_data
record_time

# exit explicitly
exit 0

snippet of liveness probe

        livenessProbe:
          exec:
            command:
              - /bin/sh
              - /srv/liveness-probe.sh
          initialDelaySeconds: 60
          periodSeconds: 60
          failureThreshold: 4
          successThreshold: 1
          timeoutSeconds: 1

The code works well with the python:3.7 docker image which we're using. If you're not using that you'll need to make sure you have coreutils and grep are installed at the least.

These liveness probe values also work well for us but can mean there are 'unhealthy' events in the operator as the period is less than the frequency of communication I've seen in between the operator and the API. The period can be bumped to 4 minutes and the failure threshold reduced to 1 to clean up your events as well.

dozzman avatar Oct 22 '19 02:10 dozzman

I also observed this strange behaviour once (accidentally) — when the operator didn't see the custom resource created. Surprisingly, it saw other resources and reacted to them. But I couldn't catch and reproduce this situation after the restart.

Meanwhile, there is a new pre-release: kopf==0.23rc1 (0.23 will be released in couple days if the live testing goes well).

Among many other changes, this version fully switches Kopf to asyncio/aiohttp. Previously, there were synchronous pykube-ng -> requests -> urllib3 -> socket calls, which are now completely gone. And they already caused issues with the sync socket reads being stuck at exit (see #152).

It is highly probable that the issue will not reappear, since the whole I/O machinery is now changed. I'll be looking for this case more carefully. Please, do so too, and report if it does reappear. I will keep this issue open for few weeks, but then will close it if everything goes smooth.


Also —though not directly related to the reported issue— there is now a liveness endpoint possible with kopf run --liveness http://:8080/healthz, with optional @kopf.on.probe() handlers to collect the data for reporting as a response to the liveness probe. See the health-checks documentation. Maybe that will be useful for you.

nolar avatar Nov 13 '19 17:11 nolar

@nolar We experience something very similar with latest kopf release still (minus connection reset by peer part): after leaving the controller for some time (like overnight) it stops handling changes of resources it watches - there's complete quietness in handlers logs, kopf annotation in the resource is not updated. We've enabled liveness probes too, they are constantly passing (per logs and no container restarts), but resource handling is just not happening, controller seem to not receive updates from K8s. Only controller restart helps.

Platform: Ubuntu 18.04 Kubernetes: 1.14 / Docker Enterprise Python: 3.7.5 Python libs (relevant ones): kopf 0.25, aiohttp 3.6.2

After some netstatting and tcpdumping it seems that TCP connection is there, but the HTTP one is gone, but controller does not notice. May be a manifestation of https://github.com/kubernetes/kubernetes/issues/42552

Current suspect is that timeout=aiohttp.ClientTimeout(total=None) arg passed to GET ..?watch=true request here: https://github.com/zalando-incubator/kopf/blob/51a3a70e09a17cf3baec2946b64b125a90595cf4/kopf/clients/watching.py#L204 While I understand why it may be preferable to do so for such watch requests, it seems that (bad?) networking conditions / blips can cause such hangups when no strict timeout is set. Our preliminary testing shows that setting some specific timeout in this place does not stop normal controller operations, but in case of lost connection controller crashes and is restarted by K8s, usually alleviating the problem.

I'll push a PR to make this timeout configurable (with default behaviour preserved as current one).

pshchelo avatar Mar 06 '20 14:03 pshchelo

I too saw a connection reset by peer today, which caused the thread to hang. Here was the output:

Root task 'poster of events' is failed: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 453, in _root_task_checker
    await coro
  File "/usr/local/lib/python3.8/dist-packages/kopf/engines/posting.py", line 165, in poster
    await events.post_event(
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/events.py", line 77, in post_event
    response = await context.session.post(
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client.py", line 504, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client_reqrep.py", line 847, in start
    message, payload = await self._protocol.read()  # type: ignore  # noqa
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/streams.py", line 591, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer
Root task 'daemon killer' is finished unexpectedly.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nhd/TriadController.py", line 134, in TriadControllerLoop
    loop.run_until_complete(kopf.operator(
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 161, in operator
    await run_tasks(operator_tasks, ignored=existing_tasks)
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 370, in run_tasks
    await _reraise(root_done | root_cancelled | hung_done | hung_cancelled)
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 437, in _reraise
    task.result()  # can raise the regular (non-cancellation) exceptions.
  File "/usr/local/lib/python3.8/dist-packages/kopf/reactor/running.py", line 453, in _root_task_checker
    await coro
  File "/usr/local/lib/python3.8/dist-packages/kopf/engines/posting.py", line 165, in poster
    await events.post_event(
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/auth.py", line 45, in wrapper
    return await fn(*args, **kwargs, context=context)
  File "/usr/local/lib/python3.8/dist-packages/kopf/clients/events.py", line 77, in post_event
    response = await context.session.post(
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client.py", line 504, in _request
    await resp.start(conn)
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/client_reqrep.py", line 847, in start
    message, payload = await self._protocol.read()  # type: ignore  # noqa
  File "/usr/local/lib/python3.8/dist-packages/aiohttp/streams.py", line 591, in read
    await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

cliffburdick avatar Jun 17 '20 23:06 cliffburdick