(Potential) Memory leak after exception handling and pipeline restarts
During testing, we observed a sudden increase of memory consumption by almost all of our logprep instances:
It turned out that we had configured a wrong TLS certificate for the Opensearch cluster, so that the OpensearchOutputConnector instances could not establish connections. This lead to various FatalOutputErorr exceptions (and following pipeline restarts):
2024-02-26 09:29:06,360 Logprep Pipeline 1 ERROR : FatalOutputError in OpensearchOutput (opensearch) - Opensearch Output: ['os-cluster.opensearch-prod']: ConnectionError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)) caused by: SSLError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006))
In our core system, we made a similar observation: The memory consumption of our logprep instances started to increase. Although the increasing demand was not as strong as observed in the test-system, some of the pods ran out of memory. We made this observation twice at two different points in time.
A short review revealed that we had some network/DNS issues at both occaisons . Our pipelines could not reach our Opensearch cluster, which lead to a lot of FatalOutputError exceptions and pipeline restarts:
2024-02-25 20:05:44,131 opensearch WARNING : GET https://prod-os-cluster.opensearch:9200/ [status:N/A request:10.008s]
Traceback (most recent call last):
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.pex/installed_wheels/e55d5dac054d07afab930a0d5f3de8475381721e9eca3728fbdda611fa0ed070/opensearch_py-2.4.2-py2.py3-none-any.whl/opensearchpy/connection/http_urllib3.py", line 264, in perform_request
response = self.pool.urlopen(
^^^^^^^^^^^^^^^^^^
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 799, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/util/retry.py", line 525, in increment
raise six.reraise(type(error), error, _stacktrace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/packages/six.py", line 770, in reraise
raise value
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 715, in urlopen
httplib_response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 404, in _make_request
self._validate_conn(conn)
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 1058, in _validate_conn
conn.connect()
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 363, in connect
self.sock = conn = self._new_conn()
^^^^^^^^^^^^^^^^
File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
2024-02-25 20:05:44,134 Logprep Pipeline 14 ERROR : FatalOutputError in OpensearchOutput (opensearch) - Opensearch Output: ['os-cluster.opensearch']: ConnectionError(<urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution) caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution)
It seems that when any of the above exceptions occur, and pipelines need to be re-started, something is not completely freed, which leads to the increasing amount of memory used. But it is not completely clear what is causing the memory issues here.
Expected behavior Occurence of the above exceptions and/or pipeline re-starts should not cause logprep to consume more memory.
Environment
Logprep version: 2b16c19cd785b08a74bbba3c384a2eec192e984a Python version: 3.11
Thank you for this report. In my Opinion we should consider to redesign the failure reaction of logprep.
Restarting processes in failure cases should not be the responsibility of an aplication.
Instead we should exit in all critical Failure cases with the right exit code.
Restarting the aplication should be the task of an init System like Systemd or of the container runtime
possibly the log queue is not closed. this is fixed in logprep 10.0.2
should have been solved with new error handling.