M5GFX icon indicating copy to clipboard operation
M5GFX copied to clipboard

textLength function returns invalid UTF-8 truncation positions for multi-byte characters

Open mengxiyou opened this issue 7 months ago • 0 comments

Issue Description

The textLength function may return a length value that splits multi-byte UTF-8 characters, resulting in invalid UTF-8 sequences when truncating strings.

Steps to Reproduce

  1. Input: A UTF-8 string containing multi-byte characters (e.g., a 3-byte character like U+090B: "\xE0\xA4\x8B").

  2. Call: Invoke textLength with a width parameter that forces truncation within the multi-byte character’s byte sequence.

  3. Result: The returned length points to the middle of the character, producing a malformed UTF-8 substring.

Example:

const char* text = "\xE0\xA4\x8B"; // Valid 3-byte UTF-8 character  
int32_t maxWidth = ...; // Width that truncates mid-character  
int32_t len = display.textLength(text, maxWidth);  

// Truncated string: text[0..len-1] may be "\xE0" or "\xE0\xA4", both invalid.

Current Behavior

The function processes UTF-8 strings byte-by-byte, advancing the pointer even when decoding multi-byte characters. If truncation occurs mid-character, the returned length does not rewind to the start of the incomplete character.

Expected Behavior

The function should return lengths that align with complete UTF-8 character boundaries, ensuring truncated strings are valid UTF-8.

Root Cause

  • Byte-wise Pointer Increment: The loop increments the pointer (++string) for each byte processed, regardless of whether those bytes belong to a multi-byte UTF-8 character.

  • No Rewind on Truncation: When truncating mid-character, the function returns the current byte offset instead of rewinding to the start of the incomplete character.

This issue impacts any use case where textLength is used to safely truncate UTF-8 strings (e.g., text rendering, serialization).

mengxiyou avatar May 24 '25 15:05 mengxiyou