html-agility-pack
html-agility-pack copied to clipboard
Accessing DocumentNode.OuterHtml Causes Stack Overflow Exception On Demand
I've encountered a strange situation with HTML source from http://portalamis.org.br/?secao=noticias See the raw html in the attached file: http-portalamis.org.br-secao-noticias.html.txt
Here's my code:
public HtmlAgilityPack.HtmlDocument document { get; private set; }
....
....
encoding = Encoding.UTF8;
this.document = new HtmlAgilityPack.HtmlDocument();
this.document.OptionFixNestedTags = true;
this.document.OptionAutoCloseOnEnd = true;
this.document.OptionDefaultStreamEncoding = encoding;
this.document.LoadHtml(htmlContent);
Then simply accessing
this.document.DocumentNode.OuterHtml
causes a stack overflow on demand.
Hello @blankers ,
Thank you for reporting,
We will look at this issue soon.
Best Regards,
Jonathan
Hello @blankers ,
Just to let you know we took some time recently to investigate it but unfortunately, we have not been able to find out the cause.
We will try to investigate it again when my new developer will be more comfortable with this library.
Best Regards,
Jonathan
Please find attached a project that also exhibits a stack overflow when run, on getting the OuterHtml of a node. The code in the project reads some HTML, modifies it a bit, then tries to access the OuterHtml of the document node. I have not taken the time to investigate whether the modifications are a necessary part of reproducing the problem.
When the relevant code is run in the context of an ASP.NET Core web site, different behaviour is observed. If the code is running under the debugger, the debugger closes with no user interaction. Setting a breakpoint at the line that accesses the OuterHtml getter and mousing over it causes a popup to appear as seen in the screengrab. Googling the error code 0xc0000005, it appears to mean that an access violation occurred.
Further to the above - the failure is not seen (in either of its forms) if the line node.Attributes.RemoveAll(); is commented out.
Workaround
private static void RemoveAllAttributes(HtmlNode node)
{
// We should be able to do this:
// node.Attributes.RemoveAll();
// But there is a bug, see https://github.com/zzzprojects/html-agility-pack/issues/103
var attributeNames = node.Attributes.Select(attr => attr.Name).ToArray();
foreach (string attrName in attributeNames)
{
node.Attributes.Remove(attrName);
}
}
It's an old issue, but I also hit it. Debugging the code now, reproducing the Stack Overflow (call stack from bottom to top):
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 660 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlTextNode.Text.get() Line 67 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 1984 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo() Line 2145 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.UpdateHtml() Line 2183 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 663 C#
[The 5 frame(s) above this were repeated 1282 times]
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo() Line 2145 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.UpdateHtml() Line 2183 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 663 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlTextNode.Text.get() Line 67 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 1984 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 2034 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 2034 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteTo(System.IO.TextWriter outText, int level) Line 2034 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo(System.IO.TextWriter outText, int level) Line 1881 C#
> HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.WriteContentTo() Line 1892 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.UpdateHtml() Line 2182 C#
HtmlAgilityPack.dll!HtmlAgilityPack.HtmlNode.OuterHtml.get() Line 663 C#
HSPI_AKWeather.exe!HSPI_AKWeather.HtmlGeneratorWeather.WriteHtmlFile() Line 143 C#
Further investigation (I don't quite understand the code yet, but..) class HtmlTextNode.Text - when _text == null calls base.OuterHtml - which basically leads to the infinite loop:
/// <summary>
/// Gets or Sets the text of the node.
/// </summary>
public string Text
{
get
{
if (_text == null)
{
return base.OuterHtml;
}
return _text;
}
set
{
_text = value;
SetChanged();
}
}
More info. The problem happens if I
- Call htmlDoc.LoadHtml(html)
- save html once (call htmlDoc.DocumentNode.OuterHtml),
- then call some SetAttributeValue() again,
- then save html again (call htmlDoc.DocumentNode.OuterHtml)
Hello @alexbk66 ,
Do you think you could reproduce the issue in a Fiddle? Not sure if it will get fixer but surely we can look at it.
Here is a working fiddle with your example: https://dotnetfiddle.net/ImPNc1
Best Regards,
Jon
Hi Jon,
I copied my HTML https://dotnetfiddle.net/LQ5nAB
It doesn't 'stack overflow', but sill fails because of the
System.NullReferenceException: Object reference not set to an instance of an object. at Program.Main() in d:\Windows\Temp\xnyzzw5v.0.cs:line 261
In VisualStudio 'stack overflow' also happens at this tag.
But if i add spaces around the tag - then it works.
I'll try adding spaces in my code to see if it works and report later.
< style id = ""gwd-text-style"" >
Hello @alexbk66 ,
It currently fail because the end tag is badly formatted </ style >
(a space), so it likes no label exists for HAP.
Therefore the following line var node = htmlDoc.GetElementbyId("label_14");
has his node to null, which throws the null reference on this line: var tn1 = (HtmlTextNode)node.SelectSingleNode("text()");
I will wait for your investigation to reproduce the stack overflow issue.
The spaces were inserted by .NET Fiddle for some reason when I copied the code. You are right though, if I remove the spaces - it works.