wazuh icon indicating copy to clipboard operation
wazuh copied to clipboard

Implemented valid UTF8 character checks

Open zbalkan opened this issue 1 year ago • 7 comments

Related issue
https://github.com/wazuh/wazuh/issues/25967

Description

This is a continuation of issue https://github.com/wazuh/wazuh/issues/23354, about the fix PR https://github.com/wazuh/wazuh/pull/23543.

This PR addresses an issue with the UTF-8 validation logic in the agent where valid UTF-8 multibyte characters were mistakenly being identified as invalid. The original implementation performed overly restrictive checks on sequences of bytes representing characters like Ü, ü, Õ, õ, Ö, ö, Ä, ä, Ş, ş, Ç, ç, causing the File Integrity Monitoring (FIM) module to incorrectly ignore file paths containing these characters.

Problem

The original validation logic checked for valid UTF-8 sequences but incorrectly marked certain valid multibyte characters as invalid due to overly restrictive rules on the leading byte of 2-, 3-, and 4-byte sequences. As a result, characters that are fully compliant with the UTF-8 standard were ignored, causing the FIM module to overlook legitimate file paths containing these characters. This led to unintended behavior in path validation and monitoring.

Also, it is seen that the python test cases to check invalid UTF-8 characters are incorrect as well. The characters used as invalid tests are actually valid characters.

Solution

The macros for validating UTF-8 sequences have been updated to properly handle all valid UTF-8 byte ranges:

  • valid_2: Now properly validates 2-byte sequences, ensuring no overlong encodings occur and that valid 2-byte sequences are recognized.
  • valid_3: Correctly handles special cases where the leading byte is 0xE0 or 0xED. Overlong encodings starting with 0xE0 are excluded, and surrogate halves (reserved for UTF-16) starting with 0xED are correctly identified as invalid.
  • valid_4: Properly validates 4-byte sequences, ensuring sequences that start with 0xF0 are not overlong and that sequences do not exceed the Unicode limit (U+10FFFF).

With these fixes, the validation logic correctly identifies all valid UTF-8 sequences, including multibyte characters commonly used in various languages.

Configuration options

Logs/Alerts example

Tests

  • Compilation without warnings in every supported platform
    • [x] Linux
    • [ ] Windows
    • [ ] MAC OS X
  • [ ] Source installation
  • [ ] Package installation
  • [ ] Source upgrade
  • [ ] Package upgrade
  • [ ] Review logs syntax and correct language
  • [ ] QA templates contemplate the added capabilities
  • Memory tests for Linux
    • [ ] Scan-build report
    • [ ] Coverity
    • [ ] Valgrind (memcheck and descriptor leaks check)
    • [ ] Dr. Memory
    • [ ] AddressSanitizer
  • Memory tests for Windows
    • [ ] Scan-build report
    • [ ] Coverity
    • [ ] Dr. Memory
  • Memory tests for macOS
    • [ ] Scan-build report
    • [ ] Leaks
    • [ ] AddressSanitizer
  • [ ] Retrocompatibility with older Wazuh versions
  • [ ] Working on cluster environments
  • [ ] Configuration on demand reports new parameters
  • [ ] The data flow works as expected (agent-manager-api-app)
  • [x] Added unit tests (for new features)
  • [ ] Stress test for affected components
  • Decoder/Rule tests
    • [ ] Added unit testing files ".ini"
    • [ ] runtests.py executed without errors

zbalkan avatar Oct 11 '24 15:10 zbalkan