html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Support HTML5 entities

Open leoshusar opened this issue 4 years ago • 2 comments

Hi! Would it be possible to add support for HTML5 entities? .NET team dropped the PR since they are not backwards compatible and there was little interest from people so they decided not to update it yet.

Few examples I have run into today are ! ( ) $comma; ...

leoshusar avatar Dec 06 '21 01:12 leoshusar

Hello @leoshusar ,

Just to make sure, what is exactly the behavior you are looking for? Could you show us an example?

I know there is already some stuff that we support in this part.

See:

  • https://github.com/zzzprojects/html-agility-pack/blob/master/src/HtmlAgilityPack.Shared/HtmlEntity.cs#L54
  • https://github.com/zzzprojects/html-agility-pack/blob/c41452a1ebd2f7549767b4924596cccc3eca8ded/src/HtmlAgilityPack.Shared/HtmlAttribute.cs#L229

But we indeed maybe not support what you are looking for but this is the part I'm not sure about your request.

Best Regards,

Jon


Sponsorship Help us improve this library

Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function

JonathanMagnan avatar Dec 06 '21 13:12 JonathanMagnan

Hi, @JonathanMagnan,

for example this string: {[()]},!@"€#&~ˇ^˘°=; when you use e.g. this website for encoding, you will get this fully encoded string:

{[()]},!@"€#&~ˇ^˘°=;

and these are outputs when you try do decode it in C#:

HttpUtility.HtmlDecode: {[()]},!@"?#&~ˇ^˘°=;
HtmlEntity.DeEntitize:  {[()]},!@"?#&~ˇ^˘°=;

because neither of these decoders have HTML5 support. Here is the W3 spec with all the HTML5 characters, there is 2231 of them :) But there are some differences between HTML4 and 5 (noted here), for example:

The ⟨ and ⟩ named character references now expand to U+27E8 and U+27E9 (mathematical left/right angle bracket) instead of U+2329 and U+232A (left/right-pointing angle bracket), respectively.

so the DeEntitizer cannot just be updated with new characters. And that's also the reason why the PR was not merged in dotnet.

leoshusar avatar Dec 06 '21 16:12 leoshusar