Unicode character rendering at wrong position which after any ascii characters (vice versa)
Example code:
HtmlPanel.Text =
"""
<html>
<body>
<h3>Slash"/"</h3>
D:/folder<br/>
D:/文件夹<br/>
D:/でたらめ<br/>
D:/무작위의<br/>
D:/❽(black 8 ball)<br/>
<h3>Backslash"\"</h3>
D:\folder<br/>
D:\文件夹<br/>
D:\でたらめ<br/>
D:\무작위의<br/>
D:\❽(black 8 ball)<br/>
<h3>Other Symbols</h3>
<h4>":"</h4>
:folder<br/>
:文件夹<br/>
:でたらめ<br/>
:무작위의<br/>
:❽(black 8 ball)<br/>
<h4>"|"</h4>
|folder<br/>
|文件夹<br/>
|でたらめ<br/>
|무작위의<br/>
|❽(black 8 ball)<br/>
<h4>'"'</h4>
"folder<br/>
"文件夹<br/>
"でたらめ<br/>
"무작위의<br/>
"❽(black 8 ball)<br/>
<h4>"X"</h4>
Xfolder<br/>
X文件夹<br/>
Xでたらめ<br/>
X무작위의<br/>
X❽(black 8 ball)<br/>
</body>
</html>
""";
Render result:
Checkout the black 8 ball line, the char "❽" and word "black" are rendered at wrong place, "black" not located behind "❽".
And Japanese character behavior not same as others. Others are only affect one character on the bound of ascii and unicode character, but all Japanese characters are shifted to the left to the wrong position.
Check the code, it may caused by CssBox.ParseToWords wrong logic.
https://github.com/AvaloniaUI/Avalonia.HtmlRenderer/blob/dbc94f463122b95b92cbd22552e82e714d42e4c5/external/HtmlRenderer/Core/Dom/CssBox.cs#L556-L608
Looks it assigns characters that should not be grouped into a block into a block.
@maxkatz6 I wrote a new grouping logic to replace the buggy code. It looks well for me (using pseudotext), but I haven't tested it in all scenarios.
endIdx = startIdx;
// Check if the current character is an ASCII character
var isAscii = text[endIdx] < 128;
// If the current character is not a whitespace
if (!char.IsWhiteSpace(text[endIdx]))
{
// Move to the next character
endIdx++;
// If the current character is ASCII
if (isAscii)
{
// Continue moving to the next character as long as it is an ASCII character,
// not a whitespace, and not a symbol
while (endIdx < text.Length &&
text[endIdx] < 128 &&
!char.IsWhiteSpace(text[endIdx]) &&
!char.IsSymbol(text[endIdx]))
endIdx++;
// If the next character is not a control character and is a hyphen, move to the next character
if (endIdx < text.Length &&
char.GetUnicodeCategory(text[endIdx]) != UnicodeCategory.Control &&
text[endIdx] == '-')
endIdx++;
}
// If the current character is not ASCII and next char is punctuation
else if (endIdx < text.Length && char.GetUnicodeCategory(text[endIdx]) == UnicodeCategory.OtherPunctuation)
{
// Move to the next character
endIdx++;
}
}
Buggy code part: https://github.com/AvaloniaUI/Avalonia.HtmlRenderer/blob/dbc94f463122b95b92cbd22552e82e714d42e4c5/external/HtmlRenderer/Core/Dom/CssBox.cs#L571-L595
This logic should use the LineBreakEnumerator
@Gillibald It cannot correctly demarcate the boundaries between ascii character and unicode character, which will cause rendering wrongs. Example:
Lorem Ipsum:无处不在的占位符文本,笼罩在神秘之中,但在设计中至关重要。
^ ^^
:: U+FF1A : FULLWIDTH COLON
Code:
var lineBreaks = new LineBreakEnumerator(text);
while (lineBreaks.MoveNext(out var lineBreak))
{
_boxWords.Add(new CssRectWord(this, HtmlUtils.DecodeHtml(text.Slice(startIdx, lineBreak.PositionMeasure - startIdx).ToString()).AsMemory(),
false, lineBreak.PositionWrap > lineBreak.PositionMeasure));
startIdx = lineBreak.PositionWrap;
}
If want to solve the above problem, it still need to analyze each word of each segment, it seems that there is no benefit in using LineBreakEnumerator.
You can also use the GraphemeEnumerator to not split any sequences that belong to each other.
@Gillibald I tried it, but I don't have enought time to write the code that match Unicode-compliant.
So I think if want to use LineBreakEnumerator and GraphemeEnumerator to impove in this library, I'd like to leave it to who is professional on it.
I'll use my "hacky" solution, It's enought for me.
Such as it cannot correctly distinguish other Unicode characters with the concept of "words". But my scenario does not need to consider these. This library needs to be oriented to a wider user group, my code is not perfect enough.
I am working on fix for this issue. I will try to make pull request on this week.
@Gillibald I have created a refactorization (#77) with GraphemeEnumerator.