Avalonia.HtmlRenderer Unicode character rendering at wrong position which after any ascii characters (vice versa)

Example code:

HtmlPanel.Text =
    """
    <html>
    <body>
    <h3>Slash"/"</h3>
    D:/folder<br/>
    D:/文件夹<br/>
    D:/でたらめ<br/>
    D:/무작위의<br/>
    D:/❽(black 8 ball)<br/>
    <h3>Backslash"\"</h3>
    D:\folder<br/>
    D:\文件夹<br/>
    D:\でたらめ<br/>
    D:\무작위의<br/>
    D:\❽(black 8 ball)<br/>
    <h3>Other Symbols</h3>
    <h4>":"</h4>
    :folder<br/>
    :文件夹<br/>
    :でたらめ<br/>
    :무작위의<br/>
    :❽(black 8 ball)<br/>
    <h4>"|"</h4>
    |folder<br/>
    |文件夹<br/>
    |でたらめ<br/>
    |무작위의<br/>
    |❽(black 8 ball)<br/>
    <h4>'"'</h4>
    "folder<br/>
    "文件夹<br/>
    "でたらめ<br/>
    "무작위의<br/>
    "❽(black 8 ball)<br/>
    <h4>"X"</h4>
    Xfolder<br/>
    X文件夹<br/>
    Xでたらめ<br/>
    X무작위의<br/>
    X❽(black 8 ball)<br/>
    </body>
    </html>
    """;

Render result:

Checkout the black 8 ball line, the char "❽" and word "black" are rendered at wrong place, "black" not located behind "❽". And Japanese character behavior not same as others. Others are only affect one character on the bound of ascii and unicode character, but all Japanese characters are shifted to the left to the wrong position.

Jul 15 '25 04:07 Flithor

Check the code, it may caused by CssBox.ParseToWords wrong logic. https://github.com/AvaloniaUI/Avalonia.HtmlRenderer/blob/dbc94f463122b95b92cbd22552e82e714d42e4c5/external/HtmlRenderer/Core/Dom/CssBox.cs#L556-L608 Looks it assigns characters that should not be grouped into a block into a block.

Jul 15 '25 04:07 Flithor

@maxkatz6 I wrote a new grouping logic to replace the buggy code. It looks well for me (using pseudotext), but I haven't tested it in all scenarios.

endIdx = startIdx;
// Check if the current character is an ASCII character
var isAscii = text[endIdx] < 128;
// If the current character is not a whitespace
if (!char.IsWhiteSpace(text[endIdx]))
{
    // Move to the next character
    endIdx++;
    // If the current character is ASCII
    if (isAscii)
    {
        // Continue moving to the next character as long as it is an ASCII character,
        // not a whitespace, and not a symbol
        while (endIdx < text.Length &&
               text[endIdx] < 128 &&
               !char.IsWhiteSpace(text[endIdx]) &&
               !char.IsSymbol(text[endIdx]))
            endIdx++;
        // If the next character is not a control character and is a hyphen, move to the next character
        if (endIdx < text.Length &&
            char.GetUnicodeCategory(text[endIdx]) != UnicodeCategory.Control &&
            text[endIdx] == '-')
            endIdx++;
    }
    // If the current character is not ASCII and next char is punctuation
    else if (endIdx < text.Length && char.GetUnicodeCategory(text[endIdx]) == UnicodeCategory.OtherPunctuation)
    {
        // Move to the next character
        endIdx++;
    }
}

Buggy code part: https://github.com/AvaloniaUI/Avalonia.HtmlRenderer/blob/dbc94f463122b95b92cbd22552e82e714d42e4c5/external/HtmlRenderer/Core/Dom/CssBox.cs#L571-L595

Jul 15 '25 08:07 Flithor

This logic should use the LineBreakEnumerator

Jul 15 '25 11:07 Gillibald

@Gillibald It cannot correctly demarcate the boundaries between ascii character and unicode character, which will cause rendering wrongs. Example:

Lorem Ipsum：无处不在的占位符文本，笼罩在神秘之中，但在设计中至关重要。
      ^    ^^

：: U+FF1A : FULLWIDTH COLON Code:

var lineBreaks = new LineBreakEnumerator(text);
while (lineBreaks.MoveNext(out var lineBreak))
{
    _boxWords.Add(new CssRectWord(this, HtmlUtils.DecodeHtml(text.Slice(startIdx, lineBreak.PositionMeasure - startIdx).ToString()).AsMemory(),
        false, lineBreak.PositionWrap > lineBreak.PositionMeasure));
    startIdx = lineBreak.PositionWrap;
}

If want to solve the above problem, it still need to analyze each word of each segment, it seems that there is no benefit in using LineBreakEnumerator.

Jul 16 '25 03:07 Flithor

You can also use the GraphemeEnumerator to not split any sequences that belong to each other.

Jul 16 '25 03:07 Gillibald

@Gillibald I tried it, but I don't have enought time to write the code that match Unicode-compliant. So I think if want to use LineBreakEnumerator and GraphemeEnumerator to impove in this library, I'd like to leave it to who is professional on it. I'll use my "hacky" solution, It's enought for me.

Such as it cannot correctly distinguish other Unicode characters with the concept of "words". But my scenario does not need to consider these. This library needs to be oriented to a wider user group, my code is not perfect enough.

Jul 16 '25 08:07 Flithor

I am working on fix for this issue. I will try to make pull request on this week.

Jul 22 '25 23:07 ia-alpatov

@Gillibald I have created a refactorization (#77) with GraphemeEnumerator.

Nov 03 '25 13:11 ajtn123