html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

  and other escape sequences are saved incorrectly when using XHTML mode

Open Mertsch opened this issue 3 years ago • 3 comments

1. Description

[TestMethod]
public void OptionOutputAsXmlBugTest()
{
    string html = @"Start| |<|>|&|€|£|"|'|End";
    HtmlDocument htmlDocument = new HtmlDocument
        {
            OptionOutputAsXml = true,
        };
    htmlDocument.LoadHtml(html);
    StringWriter stringWriter = new StringWriter(new StringBuilder(html.Length + 1000), CultureInfo.InvariantCulture);
    htmlDocument.Save(stringWriter);
    Assert.AreEqual("<?xml version=\"1.0\" encoding=\"utf-8\"?>Start|&nbsp;|&lt;|&gt;|&amp;|&euro;|&pound;|&quot;|&apos;|End", stringWriter.ToString());
    //           Actual: <?xml version="1.0" encoding="utf-8"?>Start|&amp;nbsp;|&lt;|&gt;|&amp;|&amp;euro;|&amp;pound;|&quot;|&amp;apos;|End
}

As you can see &nbsp; is saved as &amp;nbsp;. Same goes for other HTML escape sequences, but not all 🤪

  • HAP version: 1.11.42
  • NET version: .NET 6.0.1

Mertsch avatar Feb 08 '22 08:02 Mertsch

Hello @Mertsch ,

This is expected since this is how a &nbsp is escaped in XML: https://www.freeformatter.com/xml-escape.html

There is indeed some change possible that we could do as discussed here: https://github.com/zzzprojects/html-agility-pack/issues/456 but if we talk purely XML, that is the right behavior.

Best Regards,

Jon


Sponsorship Help us improve this library

Performance Libraries context.BulkInsert(list, options => options.BatchSize = 1000); Entity Framework ExtensionsBulk OperationsDapper Plus

Runtime Evaluation Eval.Execute("x + y", new {x = 1, y = 2}); // return 3 C# Eval FunctionSQL Eval Function

JonathanMagnan avatar Feb 08 '22 13:02 JonathanMagnan

Hello @JonathanMagnan Thank you very much for your explanation and time.

I do understand now, that HTML & characters need to be escaped for XML. But as your link suggests shouldn't the the output be Start|&amp;nbsp;|&amp;lt;|&amp;gt;|&amp;amp;|&amp;euro;|&amp;pound;|&amp;quot;|&amp;apos;|End by &amp;ing every & in the text?!

The linked issue #456 I do not fully understand. It seems there is the "backwards compatible" flag which specifically keeps &nbsp, but if it's about XML escaping ... why only some &s?

Mertsch avatar Feb 09 '22 20:02 Mertsch

Hello @Mertsch ,

My bad, I just saw the part about the &nbsp of your initial post.

That OptionOutputAsXml is currently very confusing. I will look at it more deeply.

Best Regards,

Jon

JonathanMagnan avatar Feb 10 '22 01:02 JonathanMagnan

I do not have this trouble, if I use HtmlDocument.BackwardCompatibility = false.

ghost avatar Dec 01 '23 17:12 ghost

I have chosen to go with https://github.com/AngleSharp/AngleSharp and this issue is no longer relevant to me. If you want to close it, feel free to do so.

Mertsch avatar Dec 01 '23 19:12 Mertsch

Hello @Mertsch ,

We will close this issue in this case. AngleSharp is a great library, so surely I understand your choice.

Best Regards,

Jon

JonathanMagnan avatar Dec 01 '23 21:12 JonathanMagnan