Logprep icon indicating copy to clipboard operation
Logprep copied to clipboard

Uncaught `UnicodeDecodeError` exceptions causing pipeline crashes

Open clumsy9 opened this issue 1 year ago • 0 comments

Recently, we have encountered pipeline crashes caused by uncaught UnicodeDecodeError exceptions when the ConfluentKafkaInputConnector tries to decode raw events:

Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/venv/lib/python3.12/site-packages/logprep/framework/pipeline.py", line 223, in run
    self.process_pipeline()
  File "/opt/venv/lib/python3.12/site-packages/logprep/framework/pipeline.py", line 233, in process_pipeline
    event = self._get_event()
            ^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/logprep/framework/pipeline.py", line 259, in _get_event
    event, non_critical_error_msg = self._input.get_next(self._timeout)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/logprep/metrics/metrics.py", line 218, in inner
    result = func(self, *args, **kwargs)

  File "/opt/venv/lib/python3.12/site-packages/logprep/abc/input.py", line 280, in get_next
    event, raw_event = self._get_event(timeout)
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/logprep/connector/confluent_kafka/input.py", line 418, in _get_event
    event_dict = self._decoder.decode(raw_event)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 197: invalid start byte

One of our sources somehow managed to provide us with log messages that seem to contain latin-1 encoded character sequences:

b'{"@timestamp":"2024-07-22T12:58:21+02:00", "fromhost-ip":"192.168.178.2", "hostname":"fancy_host", "message":"Driver HP Universal Printing PCL 6 (v7.0.1) required for printer fancy_printer (Color K\xfcche) is unknown. Contact the administrator to install the driver before you log in again.", "tags":["syslog"]}\n'

Expected behavior I would somewhat expect that msgspec.json.Decoder.decode() also catches UnicodeDecodeError exceptions.

Current behavior Malformed or wrongly encoded raw events causing non-catched UnicodeDecodeError exceptions

Steps to reproduce Process raw events containing sequences of "latin-1" encoded characters??

Environment

Logprep version: 11.0.1, but also tested with latest release

Python version: 3.12.4

clumsy9 avatar Jul 23 '24 14:07 clumsy9