html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

nbsp char in xml name allowed

Open wangyoutian opened this issue 1 year ago • 5 comments

1. Description

if we append a char nbsp (0xa0) to element name, it's parsed normally without exception thrown.

eg:

<constituent></constituent  >note in the end tag before this sentence, char 0xA0, not 0x20, is appended<constituent></constituent>

will be parsed as one "constituent" element, not two. And the problem is suppressed (which is not good), and it's hard to debug, as 0xA0 is visually indiscernible from 0x20.

2. Expectation

https://dev.w3.org/html5/spec-LC/syntax.html#:~:text=HTML%20elements%20all%20have%20names,005A%20LATIN%20CAPITAL%20LETTER%20Z.

doesnot allow such chars in element name.

nor xml allows as stipulated in:

http://w3.org/TR/REC-xml/#NT-NameStartChar

;

Otherwise, it's hard to pin down the issue.

Solution?

Should we in documentation explicitly allow such chars or should we throw exception?

wangyoutian avatar Dec 18 '24 11:12 wangyoutian

Hello @wangyoutian ,

It is possible for you to reproduce this issue in .NET Fiddle

I currently get 2 "consituent" elements on my side: https://dotnetfiddle.net/Yb9nBG

The end tag with the 0xA0 is simply ignored.

Best Regards,

Jon

JonathanMagnan avatar Dec 18 '24 14:12 JonathanMagnan

https://dotnetfiddle.net/5WSaR2

is the reproduced issue (see the "test 3" there) , where: var tex =<c></c\u00a0><c></c>;

An extra notable phenomenon: if we replace the letter 'c' with 'a', then two elements are parsed out, as expected. ; see "test 4" there.

(the above code can also be found at:

https://github.com/nilnul/nilnul.html.TEST/blob/nilnul-pub/el/content/parse/nbsp/UnitTest1.cs

)

wangyoutian avatar Dec 19 '24 08:12 wangyoutian

Thank you ;)

JonathanMagnan avatar Dec 19 '24 13:12 JonathanMagnan

Hello @wangyoutian ,

What kind of behavior are you expecting? We currently have the same behavior as browsers like Firefox and Chrome.

Since this is an "EndTag" and doesn't have any corresponding "BeginTag", we simply ignore it and continue the logic. A div tag can be inside a div but an a tag cannot be inside an a tag, so they both have different behavior in the number of elements.

But I'm not expecting any kind of error to be thrown.

Let me know more as at this moment, I believe it works as intended.

JonathanMagnan avatar Dec 24 '24 13:12 JonathanMagnan

In some text input field, such as "textarea", in some webpage , when you input space(0x20), it will be converted to nbsp(0xa0).

So If one user intends to input some xml code in such text field, and inadvertently inputs a space(0x20) that is appended to the endtag name, then the space is converted into nbsp(0xa0).

The user would think it's still space, as visually the nbsp is indiscernible. And if it's indeed space (0x20), per the specification:

https://dev.w3.org/html5/spec-LC/syntax.html#:~:text=HTML%20elements%20all%20have%20names,005A%20LATIN%20CAPITAL%20LETTER%20Z

8.1.2.2 End tags End tags must have the following format:

The first character of an end tag must be a U+003C LESS-THAN SIGN character (<). The second character of an end tag must be a U+002F SOLIDUS character (/). The next few characters of an end tag must be the element's tag name. After the tag name, there may be one or more space characters. Finally, end tags must be closed by a U+003E GREATER-THAN SIGN character (>).

and also xml specification: https://www.w3.org/TR/REC-xml/#NT-ETag

[42] ETag ::= '</' Name S? '>'

,where : [3] S ::= (#x20 | #x9 | #xD | #xA)+

, it shall be parsed normally and we shall see two elements from our example mentioned above.

And if it's nbsp, per the specification, it's disallowed, and an exception shall be thrown, to warn the user that the so thought space(0x20) is indeed nbsp(0xa0). And this is the expected behavior in my opinion.

I am not sure about how firefox and chrome handle this. But there might be subtle difference between them, which renders the content that can be inspected by the user, and a library that treats the parsed document as data (which might be then fed into a rendering process that then, possibly suppressing the exception caught as a UI might usually do to cater as much as possible to the information needs of the user, displays the content to the user for inspection).

wangyoutian avatar Dec 24 '24 16:12 wangyoutian