webpp icon indicating copy to clipboard operation
webpp copied to clipboard

Unicode: fix decompose_iterator

Open the-moisrex opened this issue 10 months ago • 5 comments

I think there's an off-by-one bug in it.

the-moisrex avatar May 26 '25 07:05 the-moisrex

I probably should check operator-- as well.

the-moisrex avatar May 28 '25 09:05 the-moisrex

I think the new tests' problems are about the operator-- implementaiton.

the-moisrex avatar May 30 '25 15:05 the-moisrex

/webpp/tests/./unicode_fuzz.hpp:199: Failure
Value of: isNFC(res16.begin(), res16.end())
  Actual: false
Expected: true
Src: \x26\x03\x00\xB2
Src16: \x0326\xB200
NFC: \x0326\xB200

/webpp/tests/./unicode_fuzz.hpp:199: Failure
Value of: isNFC(res16.begin(), res16.end())
  Actual: false
Expected: true
Src: \x0A\x01\x01\x01\x01\x03\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01
Src16: \x010A\x0101\x0301\x0101\x0101\x0101\x0101\x0101\x0101\x0101
NFC: \x010A\x0101\x0301\x0101\x0101\x0101\x0101\x0101\x0101\x0101

/webpp/tests/./unicode_fuzz.hpp:199: Failure
Value of: isNFC(res16.begin(), res16.end())
  Actual: false
Expected: true
Src: \x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x03\x0A\x01
Src16: \x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x010A
NFC: \x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x0303\x010A

/webpp/tests/./unicode_fuzz.hpp:199: Failure
Value of: isNFC(res16.begin(), res16.end())
  Actual: false
Expected: true
Src: \x0A\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x01\x03\x01\x01\x01\x01\x01\x01\x01\x01
Src16: \x010A\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0301\x0101\x0101\x0101\x0101
NFC: \x010A\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0101\x0301\x0101\x0101\x0101\x0101

/webpp/tests/./unicode_fuzz.hpp:199: Failure
Value of: isNFC(res16.begin(), res16.end())
  Actual: false
Expected: true
Src: \x8A\x0A\x03\x03\x03\x03\x03\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1\xC1
Src16: \x0A8A\x0303\x0303\xC103\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1
NFC: \x0A8A\x0303\x0303\xC103\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1\xC1C1

Seems like UTF-16 is being blown up now.

the-moisrex avatar May 30 '25 16:05 the-moisrex

No, it's not UTF-16 only:

/webpp/tests/./unicode_fuzz.hpp:198: Failure
Value of: isNFC(res.begin(), res.end())
  Actual: false
Expected: true
Src: \xCC\x8A\xCC
NFC: \xCC\x8A\xC3\x8C

/webpp/tests/./unicode_fuzz.hpp:198: Failure
Value of: isNFC(res.begin(), res.end())
  Actual: false
Expected: true
Src: \xCC\x82\xCC
NFC: \xCC\x82\xC3\x8C

/webpp/tests/./unicode_fuzz.hpp:198: Failure
Value of: isNFC(res.begin(), res.end())
  Actual: false
Expected: true
Src: \xCD\x84\xCD
NFC: \xCC\x88\xCC\x81\xC3\x8D


the-moisrex avatar May 31 '25 03:05 the-moisrex

Now we at this:

/webpp/tests/./unicode_fuzz.hpp:198: Failure
Value of: isNFC(res.begin(), res.end())
  Actual: false
Expected: true
Src: \xF0\xCD\x81\xCC
NFC: \xF0\xCC\x81\xC3\x8C
decomposed: \xF0\xCC\x81\x49\xCC\x80
NFD: \xF0\xCC\x81\x49\xCC\x80

/webpp/tests/./unicode_fuzz.hpp:198: Failure
Value of: isNFC(res.begin(), res.end())
  Actual: false
Expected: true
Src: \xF0\xCD\x81\xCC
NFC: \xF0\xCC\x81\xC3\x8C
decomposed: \xF0\xCC\x81\x49\xCC\x80
NFD: \xF0\xCC\x81\x49\xCC\x80

/webpp/tests/./unicode_fuzz.hpp:198: Failure
Value of: isNFC(res.begin(), res.end())
  Actual: false
Expected: true
Src: \xF0\xCD\x81\xCC
NFC: \xF0\xCC\x81\xC3\x8C
decomposed: \xF0\xCC\x81\x49\xCC\x80
NFD: \xF0\xCC\x81\x49\xCC\x80

/webpp/tests/./unicode_fuzz.hpp:198: Failure
Value of: isNFC(res.begin(), res.end())
  Actual: false
Expected: true
Src: \xF0\xCD\x81\xCC
NFC: \xF0\xCC\x81\xC3\x8C
decomposed: \xF0\xCC\x81\x49\xCC\x80
NFD: \xF0\xCC\x81\x49\xCC\x80

/webpp/tests/./unicode_fuzz.hpp:201: Failure
Value of: isNFC(res16.begin(), res16.end())
  Actual: false
Expected: true
Src: \x3B\x18\x03\xD8\x03\x03\x03\x0A\x18\x18
Src16: \x183B\xD803\x0303\x0A03\x1818
NFC: \x183B\xD803\x0303\x0A03\x1818

/webpp/tests/./unicode_fuzz.hpp:201: Failure
Value of: isNFC(res16.begin(), res16.end())
  Actual: false
Expected: true
Src: \x03\x03\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\x03\x03\x03\x03\x03\x03\x03\x03\x00\x03\x03\x03\x03\x01
Src16: \x0303\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\x03DA\x0303\x0303\x0303\x0003\x0303\x0303
NFC: \x0303\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\xDADA\x03DA\x0303\x0303\x0303\x0003\x0303\x0303

the-moisrex avatar May 31 '25 14:05 the-moisrex

Based on the detailed examination of the commits related to issue #646, here's a comprehensive summary of the problem you were solving and the solution you implemented:

Problem Summary:

You were addressing complex issues within the Unicode processing components of the webpp library, specifically focusing on UTF-8/UTF-16/UTF-32 iterators and canonical decomposition. The core problems included:

  1. Incorrect Iterator Behavior:

    • Bidirectional Iterators: The utf32_bidi_iter and decompose_iterator had flawed logic for backward iteration (operator--), leading to incorrect code point sequences when traversing strings in reverse. This involved incorrect position management and index handling.
    • Equality and End Conditions: Iterator equality (operator==) and end-checking (at_end()) logic was imprecise, relying on multiple internal state variables instead of a single, definitive indicator of position, causing potential misbehavior in standard algorithms.
    • "Double Currenting": There was a need to track the "current" logical position of the iterator separately from the raw input position to correctly handle bidirectional movement, especially around multi-unit/multi-byte code points.
  2. Canonical Decomposition Issues:

    • Decomposition Iterator (decompose_iterator): The constructor and increment/decrement logic for decompose_iterator was not correctly handling the decomposition of complex code points (especially those that decompose into multiple code points) and managing its internal buffer.
    • Off-by-One and Edge Cases: There were subtle off-by-one errors and incorrect handling of edge cases (like end-of-string or single code points) within the decomposition process.
    • Code Point Mapping: A typo (trailing_mapped_deomps -> trailing_mapped_decomps) and potentially incorrect logic affected how code points without decomposition mappings were handled.
  3. Test Coverage and Correctness:

    • Existing tests for these components were insufficient or contained incorrect assertions about the expected behavior, particularly for complex UTF-8 sequences and backward iteration.

Solution Implemented:

  1. Refactored Iterators:

    • Introduced cur member: Added a cur member to utf32_bidi_iter, utf32_forward_iter, and decompose_iterator to explicitly track the logical "current" position. This cur pointer is updated on forward movement (++) and used to restore the position on backward movement (--), ensuring correct bidirectional traversal.
    • Simplified Equality/End Checks: Changed operator== and at_end() (and at_start()) to primarily rely on the cur pointer, making these checks more robust and reliable.
    • Improved decompose_iterator: Reworked the constructor and operator++/operator-- to correctly initialize and manage the internal decomposition buffer (buf) and indices (index, len) for sequences of decomposed code units. Backward iteration logic was fixed to correctly re-decompose the preceding base character and set the internal index appropriately.
  2. Fixed Canonical Decomposition:

    • Corrected Decomposition Logic: Updated the canonical_decompose_to function and its usages within decompose_iterator to ensure proper copying of decomposed code units into the output buffer and correct calculation of lengths.
    • Fixed Typo: Corrected the typo trailing_mapped_deomps to trailing_mapped_decomps.
    • Improved Error Handling Consistency: Ensured consistent use of the return_max_utf32 error handling mode in relevant next_code_point/prev_code_point calls within the iterators.
  3. Enhanced Testing:

    • Added Specific Test Cases: Introduced new test cases (FuzzFixes4, FuzzFixes5, FuzzFixes6, UTF32IteratorsTest, UTF8IteratorsTest) with specific problematic UTF-8 byte sequences that exposed the iterator and decomposition bugs.
    • Corrected Test Assertions: Updated the assertions in these tests to reflect the now-correct behavior of the iterators.
    • Added Reverse Iteration Tests: Included explicit tests to verify that iterating forward and then backward yields the correct sequence of code points.

In essence, you systematically identified and fixed intricate bugs in the Unicode iterator and decomposition logic, ensuring they correctly handle the complexities of UTF encoding/decoding and bidirectional traversal. The fixes involved structural changes to the iterators (adding cur) and precise adjustments to their internal state management. The expanded test suite validates these corrections.

the-moisrex avatar Aug 09 '25 08:08 the-moisrex

✦ Now I can see the issue. Let's look at the test case that's failing. It seems to be in the UTF8IteratorsTest function.

Looking at the error message again, it's clear that the issue is with the isNFC function in unicode_fuzz.hpp. The test is failing because the string is not being correctly identified as NFC normalized.

Let's look at the isNFC function implementation in normalization.hpp to understand what might be wrong:

The isNFC function is defined as:

1 template <stl::forward_iterator Iter> 2 [[nodiscard]] static constexpr bool isNFC(Iter start, Iter end) noexcept { 3 return is_normalized<norm_form::NFC>(start, end); 4 }

And is_normalized for NFC is:

1 template <norm_form Form = norm_form::NFC, stl::random_access_iterator Iter> 2 [[nodiscard]] static constexpr bool is_normalized(Iter spos, Iter const send) noexcept { 3 // ... 4 // Slow path: 5 decompose_iterator const dbeg{spos, send}; 6 decompose_iterator const dend{send, send}; 7 return is_canonically_ordered(dbeg, dend) && is_composable_to(dbeg, dend, spos, send); 8 }

The issue seems to be in the is_composable_to function. Let's look at its implementation:

1 template <stl::forward_iterator Iter, stl::forward_iterator CIter>
2 [[nodiscard]] static constexpr bool
3 is_composable_to(Iter spos, Iter const send, CIter cpos, CIter const cend) noexcept {
4     // ...
5     utf32_forward_iter rep_pin{spos, send};
6     utf32_forward_iter cp1_pin{spos, send};
7     utf32_forward_iter cp2_pin{spos, send};
8     utf32_forward_iter rep_cpin{cpos, cend};
9

10 for (; !cp1_pin.at_end(); ++cp1_pin, ++rep_pin, ++rep_cpin) { 11 auto cp1 = *cp1_pin; 12 auto const starter_ccp = *rep_cpin; 13 cp2_pin = cp1_pin; 14 ++cp2_pin; 15 for (stl::int_fast16_t prev_ccc = -1; !cp2_pin.at_end(); ++cp1_pin, ++cp2_pin) { 16 auto const cp2 = *cp2_pin; 17 auto const ccc = static_caststl::int_fast16_t(ccc_of(cp2)); 18 auto const replaced_cp = canonical_composed<max_utf32<char32_t>>(cp1, cp2); 19 if (prev_ccc < ccc && replaced_cp != max_utf32<char32_t>) { 20 // found a composition 21 cp1 = replaced_cp; 22 continue; 23 } 24 if (ccc == 0) [[likely]] { 25 break; 26 } 27 prev_ccc = ccc; 28 29 ++rep_pin; 30 ++rep_cpin; 31 if (*rep_cpin != cp2) [[unlikely]] { 32 return false; 33 } 34 } 35 36 if (starter_ccp != cp1) [[unlikely]] { 37 return false; 38 } 39 } 40 return true; 41 }

The issue is that the test case is using a specific input string that's not being correctly identified as NFC normalized. Let's look at the test case in unicode_fuzz.hpp to understand the input:

From the error message:

1 Src: \x03\x03\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA\xDA

ℹ A potential loop was detected. This can happen due to repetitive tool calls or other model behavior. The request has been halted.

the-moisrex avatar Aug 10 '25 02:08 the-moisrex

✦ We're still having issues. Let's look at the error more carefully.

  For the NFC issue, the problem is that isNFC(res16.begin(), res16.end()) is returning false when it should return true. The input
  string is quite long and complex, but the key thing is that it's not being correctly normalized to NFC.

  For the iterator issue, the problem is that upos and ubeg are not equal when they should be. The error shows that they are
  200-byte objects with different values.

  Let's focus on the iterator issue first. The problem is that upos and ubeg are not equal. Let's look at how these are created:

   1 decompose_iterator const         dbeg{spos, send};
   2 decompose_iterator const         dend{send, send};
   3 utf32_bidi_iter const            ubeg{dbeg, dend};
   4 utf32_bidi_iter                  upos{dbeg, dend};

  Both ubeg and upos are created with the same parameters, so they should be equal initially. But after the operations, they're not
  equal.

  The issue might be in how the utf32_bidi_iter handles the decompose_iterator. Let's look at the utf32_bidi_iter implementation
  again:

  The general implementation (for non-UTF32 iterators) stores:
   - beg
   - cur
   - pos
   - send
   - code_point

  And operator== compares cur == other.cur.

  But the UTF-32 specialization only stores:
   - beg
   - pos
   - send

  And operator== compares pos == other.pos.

  Since decompose_iterator is not a UTF-32 iterator, the general implementation is used. This means that upos and ubeg should be
  equal if their cur members are equal.

  Let's look at how cur is updated in the utf32_bidi_iter:

  In operator++:

   1 cur = pos;

  In operator--:

   1 pos = cur;

  So cur is the position before the last operation, and pos is the current position.

  Initially, both ubeg and upos have the same cur and pos. After ++upos, upos.cur becomes upos.pos (the old position), and upos.pos
  advances.

  After all the operations and then decrementing back to the beginning, upos.cur should be the same as ubeg.cur, and upos.pos should
  be the same as ubeg.pos.

  But the error shows they're not equal. This suggests that either:
   1. The cur or pos members are not being updated correctly
   2. The decompose_iterator is not working correctly with the utf32_bidi_iter

  Let's look at the test case more carefully. The string is u8"\xF0\xCD\x81\xCC".

  Let's trace through what should happen:
   1. ubeg and upos are both created with dbeg as the current position
   2. *upos should be 0xF0
   3. ++upos should advance to the next decomposed character
   4. *upos should be 0x301
   5. And so on...

  But the error shows that *upos is 0xCC instead of 0xF0. This means that the decompose_iterator is not returning the correct first
  value.

  Let's look at how the decompose_iterator works:

  In the constructor:

   1 canonical_decompose_to(cur_buf, pos, send);

  This should decompose the first code point pointed to by pos and store the decomposed characters in buf.

  But if the input is u8"\xF0\xCD\x81\xCC", and pos points to \xF0, then canonical_decompose_to will try to parse a UTF-8 sequence
  starting with \xF0.

  \xF0 is the start of a 4-byte UTF-8 sequence, so it expects 3 more bytes. But the next bytes are \xCD\x81, which is a valid 2-byte
  sequence, followed by \xCC.

  The canonical_decompose_to function is probably failing to parse this as a valid UTF-8 sequence and treating each byte as a
  separate character.

  Let's look at how canonical_decompose_to handles invalid UTF-8. But first, let's check if the string is supposed to be valid
  UTF-8.

  Looking at the comment in the test:

   1 // 00CD;00CD;0049 0301;00CD;0049 0301; # (Í; Í; I◌́; Í; I◌́; ) LATIN CAPITAL LETTER I WITH ACUTE
   2 // 00CC;00CC;0049 0300;00CC;0049 0300; # (Ì; Ì; I◌̀; Ì; I◌̀; ) LATIN CAPITAL LETTER I WITH GRAVE

  This suggests that:
   - \xCD\x81 represents U+00CD, which decomposes to U+0049 U+0301
   - \xCC represents U+00CC, which decomposes to U+0049 U+0300

  But the string is u8"\xF0\xCD\x81\xCC".

  If we interpret this as:
   - \xF0 - invalid UTF-8 start byte
   - \xCD\x81 - valid UTF-8 for U+0341 (COMBINING ACUTE TONE MARK)
   - \xCC - start of a UTF-8 sequence, but incomplete

  This doesn't match the comment.

  Let me re-read the test more carefully. There are two similar tests:

   1. UTF32IteratorsTest uses U"\xF0\xCD\x81\xCC"
   2. UTF8IteratorsTest uses u8"\xF0\xCD\x81\xCC"

  The U"" prefix indicates a UTF-32 string, while u8"" indicates a UTF-8 string.

  In the UTF-32 test, each \x escape represents a single UTF-32 code unit, so:
   - U"\xF0\xCD\x81\xCC" represents four UTF-32 code units: U+00F0, U+00CD, U+0081, U+00CC

  In the UTF-8 test, u8"\xF0\xCD\x81\xCC" represents four bytes in a UTF-8 string.

  Let's focus on the UTF-8 test since that's what's failing. The string u8"\xF0\xCD\x81\xCC" is indeed invalid UTF-8 because:
   - \xF0 starts a 4-byte sequence but is not followed by 3 continuation bytes
   - \xCD is a continuation byte but appears without a leading byte
   - \x81 is a continuation byte but appears without a leading byte
   - \xCC starts a 2-byte sequence but is not followed by a continuation byte

  But the test is specifically testing how the code handles invalid UTF-8, so this might be intentional.

  The problem is that the decompose_iterator is not correctly handling invalid UTF-8. When it encounters invalid UTF-8, it should
  probably treat each byte as a separate character.

  Let's modify the decompose_iterator to handle invalid UTF-8 more gracefully. The issue is that canonical_decompose_to is probably
  returning without filling the buffer when it encounters invalid UTF-8.

  Let's look at the canonical_decompose_to function to see how it handles invalid UTF-8:

  Actually, let's take a step back. The issue might be simpler. Let's look at the utf32_bidi_iter constructor again:

    1 explicit constexpr utf32_bidi_iter(Iter inp_pos, Iter inp_end) noexcept
    2   : beg{inp_pos},
    3     cur{inp_pos},
    4     pos{inp_pos},
    5     send{inp_end} {
    6     using enum error_handling;
    7     if (pos == send) {
    8         return;
    9     }
   10     code_point = checked::next_code_point<ErrorHandling, value_type>(pos, send);
   11 }

  The constructor calls checked::next_code_point to get the first code point. This function is responsible for parsing UTF-8 and
  returning a UTF-32 code point.

  If the UTF-8 is invalid, next_code_point might be returning a replacement character or handling it in some other way.

  But then in operator*, the utf32_bidi_iter returns code_point, which should be the first code point.

  However, the test is failing because *upos is 0xCC instead of 0xF0. This suggests that code_point is 0xCC.

  Let's trace through what should happen:
   1. upos is constructed with dbeg (a decompose_iterator)
   2. The decompose_iterator constructor calls canonical_decompose_to to decompose the first character
   3. The utf32_bidi_iter constructor calls next_code_point to get the first code point from the decompose_iterator
   4. *upos should return that first code point

  But *upos is returning 0xCC instead of 0xF0. This means that next_code_point is returning 0xCC as the first code point.

  This suggests that the decompose_iterator is not correctly positioning the first decomposed character at the beginning of its
  buffer.

  Let's look at how the decompose_iterator works again:

  In the constructor:

   1 canonical_decompose_to(cur_buf, pos, send);

  This should decompose the first character pointed to by pos and store the decomposed characters in buf via cur_buf.

  But if pos points to the beginning of u8"\xF0\xCD\x81\xCC", and canonical_decompose_to can't parse a valid UTF-8 sequence, it
  might be returning without filling the buffer.

  Our fix was to handle the case where len == 0 by setting buf[0] = *pos and len = 1. This should make the decompose_iterator return
  the first byte (0xF0) as the first decomposed character.

  But the test is still failing. Let's check if our fix is actually being applied.

  Let's add some debug output to understand what's happening. But first, let's look at the operator== method for utf32_bidi_iter:

  The general implementation compares cur == other.cur. But we're seeing that upos and ubeg are not equal even though they should
  be.

  Let's look at how cur is updated:

  In operator++:

   1 cur = pos;

  In operator--:

   1 pos = cur;

  So cur is the position before the last forward operation, and pos is the current position.

  Initially:
   - ubeg.cur = dbeg
   - ubeg.pos = dbeg
   - upos.cur = dbeg
   - upos.pos = dbeg

  After ++upos:
   - upos.cur = dbeg (old pos)
   - upos.pos = advanced to next position

  After all the operations and then decrementing back:
   - upos.cur should be dbeg (if we decremented back to the start)
   - upos.pos should be dbeg

  So upos and ubeg should be equal.

  But the error shows they're not equal. This suggests that either:
   1. Our operator== fix for decompose_iterator isn't working correctly
   2. There's something else wrong with how the iterators are being updated

  Let's check our operator== fix for decompose_iterator:

  We changed it from:

   1 return cur == other.cur;

  to:

   1 return cur == other.cur && index == other.index;

  This should be correct. Two decompose_iterators are equal if they point to the same position (cur) and are at the same stage of
  decomposition (index).

  But let's think about this more carefully. In the utf32_bidi_iter, cur and pos are decompose_iterators.

  When we do ++upos, we're advancing the decompose_iterator that upos.pos points to. This changes the state of that
  decompose_iterator.

  But upos.cur is a copy of the decompose_iterator before the operation. So upos.cur and upos.pos are different decompose_iterator
  objects with potentially different states.

  When we do --upos, we're setting pos = cur, which means upos.pos becomes equal to upos.cur.

  So after decrementing back to the start, upos.pos should be equal to ubeg.pos (both should be dbeg with index = 0).

  But the error shows they're not equal. This suggests that either:
   1. The decompose_iterator copy constructor or assignment operator is not working correctly
   2. The decompose_iterator operator== is not working correctly

  Let's look at the decompose_iterator copy constructor and assignment operator:

   1 constexpr decompose_iterator(decompose_iterator const&) = default;
   2 constexpr decompose_iterator& operator=(decompose_iterator const&) = default;

  These are defaulted, so they should copy all members correctly.

  Let's look at the operator== method again:

   1 [[nodiscard]] constexpr bool operator==(decompose_iterator const& other) const noexcept {
   2     // For decompose_iterator, we need to compare both the position and the index
   3     // to correctly handle cases where two iterators point to the same logical position
   4     // but at different stages of decomposition
   5     return cur == other.cur && index == other.index;
   6 }

  This should be correct.

  But wait, let me re-read the error message:

   1 Expected equality of these values:
   2   upos
   3     Which is: 200-byte object <...>
   4   ubeg
   5     Which is: 200-byte object <...>

  This is showing that upos and ubeg are not equal. These are utf32_bidi_iter objects, not decompose_iterator objects.

  So the utf32_bidi_iter::operator== is returning false. Let's look at that:

  For the general implementation (non-UTF32):

   1 [[nodiscard]] constexpr bool operator==(utf32_bidi_iter const& other) const noexcept {
   2     return cur == other.cur;
   3 }

  This compares the cur members, which are decompose_iterator objects.

  So upos.cur == ubeg.cur is returning false. This means our decompose_iterator::operator== is returning false.

  But we fixed decompose_iterator::operator== to compare both cur and index.

  Let's think about what the values of cur and index should be.

  Initially:
   - ubeg.cur = dbeg (at beginning, index = 0)
   - ubeg.pos = dbeg (at beginning, index = 0)
   - upos.cur = dbeg (at beginning, index = 0)
   - upos.pos = dbeg (at beginning, index = 0)

  After all the operations and decrementing back:
   - upos.cur should be dbeg (at beginning, index = 0)
   - upos.pos should be dbeg (at beginning, index = 0)

  So upos.cur == ubeg.cur should be true.

  But it's not. This suggests that either:
   1. The decompose_iterator objects are not being copied correctly
   2. The index values are not what we expect

  Let's think about this differently. Maybe the issue is that we're modifying the same decompose_iterator object in multiple places.

  In the utf32_bidi_iter, cur and pos are separate decompose_iterator objects. But they might be pointing to the same underlying
  data.

  Wait, let's look at the decompose_iterator members:

   1 Iter                         beg{};
   2 Iter                         cur{};
   3 Iter                         pos{};
   4 Iter                         send{};
   5 stl::int8_t                  index = 0;
   6 stl::int8_t                  len   = 0;
   7 decomposed_array<value_type> buf{};

  So each decompose_iterator has its own copy of beg, cur, pos, send, index, len, and buf.

  This means that when we copy a decompose_iterator, we get a completely independent copy with its own state.

  So ubeg.cur and upos.cur are separate objects with their own state.

  But they were both initialized with the same parameters, so initially they should have the same state.

  After all the operations and decrementing back, they should both be back to the initial state.

  But the error shows they're not equal. This suggests that either:
   1. The operations are not correctly restoring the state
   2. There's some other issue

  Let's look at the operations in the test:

    1 EXPECT_EQ(*upos, 0xF0);
    2 ++upos;
    3 EXPECT_EQ(*upos, 0x301);
    4 ++upos;
    5 EXPECT_EQ(*upos, 0x49);
    6 ++upos;
    7 EXPECT_NE(upos, uend);
    8 EXPECT_EQ(*upos, 0x300);
    9 ++upos;
   10 EXPECT_EQ(upos, uend);
   11 --upos;
   12 EXPECT_EQ(*upos, 0x300);
   13 --upos;
   14 EXPECT_EQ(*upos, 0x49);
   15 --upos;
   16 EXPECT_EQ(*upos, 0x301);
   17 EXPECT_NE(upos, ubeg);
   18 --upos;
   19 EXPECT_EQ(*upos, 0xF0);
   20 EXPECT_EQ(upos, ubeg);

  So we:
   1. Check *upos is 0xF0
   2. Increment 4 times
   3. Check we're at the end
   4. Decrement 4 times
   5. Check we're back at the beginning

  The last check EXPECT_EQ(upos, ubeg) is failing.

  This means that after decrementing back to the start, upos is not equal to ubeg.

  But ubeg was never modified, so it should still be in its initial state.

  And upos should be back in its initial state after the decrementing.

  The only way this can fail is if:
   1. ubeg is not in its initial state (but it shouldn't be modified)
   2. upos is not back in its initial state after the decrementing

  Let's think about what "initial state" means for a decompose_iterator:
   - cur points to the beginning of the input
   - pos points to the beginning of the input
   - index is 0
   - len is the length of the first decomposed character
   - buf contains the first decomposed character

  When we create ubeg and upos, they should both have the same initial state.

  When we increment upos, we modify upos.pos and upos.index, but upos.cur stays the same.

  When we decrement upos back to the start, we should restore upos.pos to upos.cur, and upos.index to 0.

  But the error shows they're not equal. This suggests that either the incrementing or decrementing is not working correctly.

  Let's look at the operator++ and operator-- methods more carefully.

  In operator++:

    1 ++index;
    2 if (index >= len) {
    3     index = 0;
    4     cur   = pos;
    5     if (send == pos) {
    6         len = 0;
    7     } else {
    8         auto cur_buf = buf.data();
    9         canonical_decompose_to(cur_buf, pos, send);
   10         len = static_cast<stl::int8_t>(cur_buf - buf.data());
   11         // Handle case where no decomposition occurred (len == 0)
   12         if (len == 0) {
   13             buf[0] = *pos;
   14             len = 1;
   15         }
   16     }
   17 }

  In operator--:

    1 --index;
    2 if (index < 0) {
    3     pos                   = cur;
    4     auto const code_point = checked::prev_code_point<return_max_utf32>(cur, beg);
    5     if (code_point == max_utf32<char32_t>) {
    6         buf[0] = *pos;
    7         len    = 1;
    8         index  = 0;
    9         return *this;
   10     }
   11     auto cur_buf = buf.data();
   12     canonical_decompose_to(cur_buf, code_point);
   13     len   = static_cast<stl::int8_t>(cur_buf - buf.data());
   14     // Handle case where no decomposition occurred (len == 0)
   15     if (len == 0) {
   16         buf[0] = static_cast<value_type>(code_point);
   17         len = 1;
   18     }
   19     index = stl::max<stl::int8_t>(len - 1, 0);
   20 }

  There's an issue with the operator-- method. When index < 0, it sets pos = cur, which is correct. But then it calls
  checked::prev_code_point(cur, beg).

  The problem is that cur is a decompose_iterator, and checked::prev_code_point expects to work with UTF-8 iterators, not
  decompose_iterator.

  This is likely causing the issue. The operator-- method is not working correctly because it's trying to call prev_code_point on a
  decompose_iterator.

  Let's fix the operator-- method to work correctly with decompose_iterator:

  Actually, let's think about this differently. The decompose_iterator is supposed to be a bidirectional iterator that iterates over
  the decomposed characters of a UTF-8 string.

  When we do ++iter, we should move to the next decomposed character.
  When we do --iter, we should move to the previous decomposed character.

  But the current implementation of operator-- is trying to call prev_code_point on the decompose_iterator itself, which doesn't
  make sense.

  The operator-- should work like this:
   1. If index > 0, just decrement index
   2. If index == 0, we need to move to the previous decomposed character:
     a. Move the underlying iterator (pos) backward to the previous UTF-8 character
     b. Decompose that character
     c. Set index to the last character of the decomposition

  But the current implementation is trying to call prev_code_point on cur, which is a decompose_iterator. This is incorrect.

  Let's fix the operator-- method:

  Actually, let's first understand what cur and beg represent in the decompose_iterator.

  Looking at the constructor:

   1 explicit constexpr decompose_iterator(Iter inp_pos, Iter inp_end) noexcept
   2   : beg{inp_pos},
   3     cur{inp_pos},
   4     pos{inp_pos},
   5     send{inp_end} {
   6     // ...
   7 }

  So:
   - beg is the beginning of the input range
   - cur is the current position in the input range for the current decomposed character
   - pos is the current position in the input range for the next decomposed character
   - send is the end of the input range

  In operator++, when we move to the next decomposed character:
   1. We increment index
   2. If we've exhausted the current decomposed character (index >= len):
     a. We set cur = pos (save the position of the next decomposed character)
     b. We decompose the next character starting at pos

  In operator--, when we move to the previous decomposed character:
   1. We decrement index
   2. If we've moved before the start of the current decomposed character (index < 0):
     a. We need to move pos back to the previous UTF-8 character
     b. We decompose that character
     c. We set index to the last character of the decomposition

  But the current implementation is trying to call prev_code_point on cur, which is incorrect because cur is a decompose_iterator,
  not a UTF-8 iterator.

  The correct approach would be to move the underlying pos iterator backward. But we don't have a direct way to do that because pos
  is of type Iter, which might not have a -- operator.

  Actually, let's look at the template parameters:

   1 template <stl::bidirectional_iterator Iter>
   2 struct decompose_iterator {

  So Iter is required to be a bidirectional iterator. This means it should have a -- operator.

  But the issue is that we're trying to move pos backward to the previous UTF-8 character. To do that, we need to know where the
  previous UTF-8 character starts.

  This is not trivial because UTF-8 characters have variable length. To find the start of the previous character, we need to move
  backward byte by byte until we find a byte that's not a continuation byte.

  But we don't have access to that logic directly. The checked::prev_code_point function is supposed to handle this, but it expects
  a UTF-8 iterator, not a decompose_iterator.

  The problem is that we're trying to use checked::prev_code_point with cur, but cur is a decompose_iterator, not a UTF-8 iterator.

  Let's look at how we can fix this. We need to move the pos iterator backward to the start of the previous UTF-8 character.

  One approach would be to move pos backward byte by byte until we find a byte that's not a continuation byte (i.e., a byte that
  doesn't have the pattern 10xxxxxx).

  But this is exactly what checked::prev_code_point does. The issue is that we're calling it with the wrong parameters.

  Let's look at the signature of checked::prev_code_point:

   1 template <error_handling ErrorHandling = error_handling::return_unchanged, typename RetType = char32_t,
     stl::bidirectional_iterator Iter>
   2 [[nodiscard]] static constexpr RetType prev_code_point(Iter& pos, Iter const beg) noexcept;

  It takes:
   1. pos - a reference to the iterator to move backward
   2. beg - the beginning of the range (to avoid moving past the beginning)

  In our decompose_iterator, we have:
   - pos - the current position in the input range for the next decomposed character
   - cur - the current position in the input range for the current decomposed character
   - beg - the beginning of the input range

  To move pos backward to the previous UTF-8 character, we should call:

   1 checked::prev_code_point(pos, beg);

  But we need to be careful because pos is the position of the next character, not the current character. To move to the previous
  character, we need to move pos backward.

  Actually, let's think about this more carefully.

  In the decompose_iterator:
   - cur is the position of the first byte of the current decomposed character in the input
   - pos is the position after the last byte of the current decomposed character in the input

  So to move backward to the previous decomposed character:
   1. We need to move pos back to cur (this gives us the end of the previous character)
   2. Then we need to move pos backward to the start of the previous character

  But this is getting complex. Let's look at a simpler approach.

  The issue is that the operator-- method is not correctly implemented. Let's rewrite it to work correctly:

the-moisrex avatar Aug 10 '25 03:08 the-moisrex

✦ Good, we're back to the original failure. Now I can see the issue more clearly.

  The errors show:
   1. *upos is U+0000 when it should be 0x301
   2. *upos is U+0000 when it should be 0x49
   3. *upos is U+0000 when it should be 0x300
   4. *upos is U+00F0 when it should be 0x300
   5. *upos is U+0000 when it should be 0x49
   6. *upos is U+0000 when it should be 0x301
   7. *upos is U+0000 when it should be 0xF0

  The issue is that *upos is returning 0 (null character) instead of the expected values. This suggests that the decompose_iterator
  is not correctly returning the decomposed characters.

  Let's look at the decompose_iterator implementation more carefully. The issue is likely in the operator* method:

   1 constexpr const_reference operator*() const noexcept {
   2     return buf[static_cast<stl::uint8_t>(index)];
   3 }

  This returns buf[index]. If buf contains null characters, then *upos will return null.

  The problem is that buf is not being filled correctly with the decomposed characters.

  Let's look at how buf is filled:

  In the constructor:

   1 canonical_decompose_to(cur_buf, pos, send);

  In operator++:

   1 canonical_decompose_to(cur_buf, pos, send);

  The issue is that canonical_decompose_to is not correctly decomposing the UTF-8 sequences.

  Let's look at what the test expects. The string is u8"\xF0\xCD\x81\xCC".

  According to the Unicode standard:
   - \xCD\x81 is U+0341 (COMBINING ACUTE TONE MARK), which decomposes to U+0301 (COMBINING ACUTE ACCENT)
   - \xCC is an incomplete UTF-8 sequence, but if treated as U+00CC (LATIN CAPITAL LETTER I WITH GRAVE), it decomposes to U+0049
     U+0300

  But the first byte \xF0 is problematic. It's the start of a 4-byte UTF-8 sequence, but it's not followed by valid continuation
  bytes.

  The test expects:
   1. First iteration: *upos should be 0xF0 (the first byte)
   2. Second iteration: *upos should be 0x301 (decomposition of \xCD\x81)
   3. Third iteration: *upos should be 0x49
   4. Fourth iteration: *upos should be 0x300 (decomposition of \xCC)

  But *upos is returning 0 instead. This suggests that the buf array is filled with null characters.

  Let's check if canonical_decompose_to is working correctly. The issue might be that it's not handling invalid UTF-8 properly.

  Let's look at the canonical_decompose_to function. But first, let's add some debug output to understand what's happening.

  Actually, let's approach this differently. Let's look at the issue more systematically.

  The problem is that the decompose_iterator is not correctly decomposing the UTF-8 string u8"\xF0\xCD\x81\xCC".

  Let's manually trace through what should happen:

   1. The decompose_iterator is constructed with pos pointing to the beginning of u8"\xF0\xCD\x81\xCC"
   2. In the constructor, canonical_decompose_to is called to decompose the first character
   3. The first "character" is \xF0, but this is an invalid UTF-8 sequence
   4. canonical_decompose_to should handle this gracefully

  Let's look at how canonical_decompose_to handles invalid UTF-8. But to do that, we need to understand how it's implemented.

  Let's look at the canonical_decompose_to function:

 ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  SearchText 'canonical_decompose_to' within webpp/unicode                                                                │
 │                                                                                                                            │
 │    Found 9 matches                                                                                                         │
 ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ Let's look at the implementation of canonical_decompose_to that takes a UTF-8 iterator:

 ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  ReadFile webpp/unicode/normalization.hpp                                                                                │
 │                                                                                                                            │
 │    Read lines 326-375 of 1236 from webpp/unicode/normalization.hpp                                                         │
 ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ Let's see more of this function:

 ╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
 │ ✔  ReadFile webpp/unicode/normalization.hpp                                                                                │
 │                                                                                                                            │
 │    Read lines 376-425 of 1236 from webpp/unicode/normalization.hpp                                                         │
 ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✦ I see the issue now. In the canonical_decompose_to function, there's this part:

   1 auto const code_point = checked::next_code_point<return_unchanged>(spos, send);

  This calls checked::next_code_point to parse a UTF-8 sequence and return a UTF-32 code point.

  If the UTF-8 sequence is invalid, next_code_point with return_unchanged will return the original byte as a code point.

  So for the input u8"\xF0\xCD\x81\xCC":
   1. \xF0 is parsed as an invalid UTF-8 sequence, so next_code_point returns 0xF0 as the code point
   2. \xCD\x81 is parsed as a valid UTF-8 sequence for U+0341, so next_code_point returns 0x341
   3. \xCC is parsed as an invalid UTF-8 sequence, so next_code_point returns 0xCC as the code point

  Then, for each code point, canonical_decompose_to looks up the decomposition:
   1. For 0xF0 (LATIN SMALL LETTER ETH), there's no decomposition, so it returns the original character
   2. For 0x341 (COMBINING ACUTE TONE MARK), it decomposes to 0x301 (COMBINING ACUTE ACCENT)
   3. For 0xCC (LATIN CAPITAL LETTER I WITH GRAVE), it decomposes to 0x49 0x300 (LATIN CAPITAL LETTER I + COMBINING GRAVE ACCENT)

  This matches what the test expects.

  So the issue is not with canonical_decompose_to. The issue is with how the decompose_iterator is using the results.

  Let's look at the decompose_iterator more carefully:

  In the constructor:

   1 auto cur_buf = buf.data();
   2 canonical_decompose_to(cur_buf, pos, send);
   3 len = static_cast<stl::int8_t>(cur_buf - buf.data());

  This should:
   1. Call canonical_decompose_to with pos pointing to the beginning of the string
   2. canonical_decompose_to should parse the first UTF-8 sequence (\xF0) and decompose it
   3. Since \xF0 (LATIN SMALL LETTER ETH) doesn't decompose, it should return the original character
   4. The decomposed characters should be stored in buf through cur_buf
   5. len should be set to the number of decomposed characters

  But the test shows that *upos is returning 0 instead of 0xF0. This suggests that buf[0] is 0.

  Let's look at how canonical_decompose_to stores the results in the output iterator.

  In the canonical_decompose_to function, when there's no decomposition:

   1 stl::size_t clen = 0;
   2 for (; beg != spos; ++beg) {
   3     clen += append<Iter>(out, *beg);
   4 }
   5 return clen;

  This copies the original UTF-8 bytes to the output iterator.

  But in the decompose_iterator, the output iterator is cur_buf, which is a pointer to buf.data().

  So append<Iter>(out, *beg) should be storing the bytes in buf.

the-moisrex avatar Aug 10 '25 04:08 the-moisrex

Finally. Thank God.

the-moisrex avatar Aug 11 '25 07:08 the-moisrex

Finally. Thank God.

the-moisrex avatar Aug 11 '25 07:08 the-moisrex