Logprep (Potential) Memory leak after exception handling and pipeline restarts

During testing, we observed a sudden increase of memory consumption by almost all of our logprep instances:

opensearch-ssl

It turned out that we had configured a wrong TLS certificate for the Opensearch cluster, so that the OpensearchOutputConnector instances could not establish connections. This lead to various FatalOutputErorr exceptions (and following pipeline restarts):

2024-02-26 09:29:06,360 Logprep Pipeline 1 ERROR   : FatalOutputError in OpensearchOutput (opensearch) - Opensearch Output: ['os-cluster.opensearch-prod']: ConnectionError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)) caused by: SSLError([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006))

In our core system, we made a similar observation: The memory consumption of our logprep instances started to increase. Although the increasing demand was not as strong as observed in the test-system, some of the pods ran out of memory. We made this observation twice at two different points in time.

opensearch-urllib-mysql

A short review revealed that we had some network/DNS issues at both occaisons . Our pipelines could not reach our Opensearch cluster, which lead to a lot of FatalOutputError exceptions and pipeline restarts:

2024-02-25 20:05:44,131 opensearch WARNING : GET https://prod-os-cluster.opensearch:9200/ [status:N/A request:10.008s]
Traceback (most recent call last):
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.pex/installed_wheels/e55d5dac054d07afab930a0d5f3de8475381721e9eca3728fbdda611fa0ed070/opensearch_py-2.4.2-py2.py3-none-any.whl/opensearchpy/connection/http_urllib3.py", line 264, in perform_request
    response = self.pool.urlopen(
               ^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
                       ^^^^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 404, in _make_request
    self._validate_conn(conn)
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connectionpool.py", line 1058, in _validate_conn
    conn.connect()
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 363, in connect
    self.sock = conn = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/root/.pex/installed_wheels/f0b2b048d0941174a2abe3ab7a6f2b48844192abdba3aaadbe83e78983387f5d/urllib3-1.26.18-py2.py3-none-any.whl/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
2024-02-25 20:05:44,134 Logprep Pipeline 14 ERROR   : FatalOutputError in OpensearchOutput (opensearch) - Opensearch Output: ['os-cluster.opensearch']: ConnectionError(<urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution) caused by: NewConnectionError(<urllib3.connection.HTTPSConnection object at 0x7f246a56ddd0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution)

It seems that when any of the above exceptions occur, and pipelines need to be re-started, something is not completely freed, which leads to the increasing amount of memory used. But it is not completely clear what is causing the memory issues here.

Expected behavior Occurence of the above exceptions and/or pipeline re-starts should not cause logprep to consume more memory.

Environment

Logprep version: 2b16c19cd785b08a74bbba3c384a2eec192e984a Python version: 3.11

Mar 04 '24 16:03 clumsy9

Thank you for this report. In my Opinion we should consider to redesign the failure reaction of logprep.

Restarting processes in failure cases should not be the responsibility of an aplication.

Instead we should exit in all critical Failure cases with the right exit code.

Restarting the aplication should be the task of an init System like Systemd or of the container runtime

Mar 04 '24 20:03 ekneg54

possibly the log queue is not closed. this is fixed in logprep 10.0.2

Mar 08 '24 12:03 ekneg54

should have been solved with new error handling.

Dec 09 '24 14:12 dtrai2