Logprep
Logprep copied to clipboard
Uncaught `UnicodeDecodeError` exceptions causing pipeline crashes
Recently, we have encountered pipeline crashes caused by uncaught UnicodeDecodeError exceptions when the ConfluentKafkaInputConnector tries to decode raw events:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/venv/lib/python3.12/site-packages/logprep/framework/pipeline.py", line 223, in run
self.process_pipeline()
File "/opt/venv/lib/python3.12/site-packages/logprep/framework/pipeline.py", line 233, in process_pipeline
event = self._get_event()
^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/logprep/framework/pipeline.py", line 259, in _get_event
event, non_critical_error_msg = self._input.get_next(self._timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/logprep/metrics/metrics.py", line 218, in inner
result = func(self, *args, **kwargs)
File "/opt/venv/lib/python3.12/site-packages/logprep/abc/input.py", line 280, in get_next
event, raw_event = self._get_event(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/logprep/connector/confluent_kafka/input.py", line 418, in _get_event
event_dict = self._decoder.decode(raw_event)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 197: invalid start byte
One of our sources somehow managed to provide us with log messages that seem to contain latin-1 encoded character sequences:
b'{"@timestamp":"2024-07-22T12:58:21+02:00", "fromhost-ip":"192.168.178.2", "hostname":"fancy_host", "message":"Driver HP Universal Printing PCL 6 (v7.0.1) required for printer fancy_printer (Color K\xfcche) is unknown. Contact the administrator to install the driver before you log in again.", "tags":["syslog"]}\n'
Expected behavior
I would somewhat expect that msgspec.json.Decoder.decode() also catches UnicodeDecodeError exceptions.
Current behavior
Malformed or wrongly encoded raw events causing non-catched UnicodeDecodeError exceptions
Steps to reproduce Process raw events containing sequences of "latin-1" encoded characters??
Environment
Logprep version: 11.0.1, but also tested with latest release
Python version: 3.12.4