html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Stack Overflow loading html document

Open johnkarbonhq opened this issue 2 years ago • 3 comments

Here is what to include in your request to make sure we implement a solution as quickly as possible.

1. Description

Parsing html emails (which are often not well formed), I got a stack overflow. Setting HtmlDocument.MaxDepthLevel did not prevent the stack overflow.

2. Exception

If you are seeing an exception, include the full exception details (message and stack trace).

Html.SetChanged... called recursively 5,000 times

3. Fiddle or Project

Repro is here: https://dotnetfiddle.net/mKe6jk

johnkarbonhq avatar May 31 '23 03:05 johnkarbonhq

Hello @johnkarbonhq ,

Thank you for reporting,

My developer will look at it.

Best Regards,

Jon

JonathanMagnan avatar May 31 '23 14:05 JonathanMagnan

Hello @johnkarbonhq ,

Which version are you using?

When my developer tried it, he successfully got the error message raised on this line:

internal void CloseNode(HtmlNode endnode, int level = 0)
{
	if (level > HtmlDocument.MaxDepthLevel)
	{
		throw new ArgumentException(HtmlNode.DepthLevelExceptionMessage);
	}

Perharp is there something we are missing to reproduce it?

Best Regards,

Jon

JonathanMagnan avatar Jun 07 '23 00:06 JonathanMagnan

I suspect the real code of @johnkarbonhq already being quite deep in the call stack when calling HtmlDocument.LoadHtml, thus resulting in the call stack becoming exhausted. With a simple console application that more or less directly calls HtmlDocument.LoadHtml this is probably not observable simply because the call stack is pretty much mostly empty and doesn't overflow.

However, it's still just my speculation i am unable to either prove or disprove because i can't analyse the call stack at the time of the observed Stackoverflow exception (only @johnkarbonhq can do that).

If my suspicion is correct, then reducing the value HtmlDocument.MaxDepthLevel further (perhaps like HtmlDocument.MaxDepthLevel = 50;) could avoid the observed stack overflow. And if that's the case, the suggested patch/fix isn't going to really fix the problem.

Why? Because if and at which internal HtmlAgilityPack method call a stack overflow happens is dependent on how full the call stack already is. Quite possibly, right now with the suggested patch @johnkarbonhq's application code will avoid the stack overflow using HtmlDocument.MaxDepthLevel = 100;. But it is possible that a (later) change to the application code or simply a different control flow in the application (for example when the control flow depends on external inputs/data/other factors) would lead to the call stack being filled just a little more before calling HtmlDocument.LoadHtml and where setting HtmlDocument.MaxDepthLevel = 100; would then not prevent a stackoverflow exception anymore. Due to its reliance on a very specific and particular assumption about the call stack usage of the application that's calling HtmlDocument.LoadHtml, the suggested fix/patch appears to me as a very brittle band-aid for the observed problem

Ideal solution -- and also most expensive, sadly -- would be for the library to avoid deep recursion and converting potentially deep recursions into iterations or loops wherever practically feasible. For the SetChanged() method, this could for example look like this without losing much readability and maintainability (Warning! Untested code!):

internal void SetChanged()
{
	var currentNode = this;
	do
	{
		currentNode._changed = true;
		currentNode = currentNode.ParentNode;
	}
	while (currentNode is not null);
}

And a HtmlDocument.MaxDepthLevel check should then apply only to the remaining potentially deep recursions related to document traversal and to other functions which have to enforce HtmlDocument.MaxDepthLevel.

And in case the application code itself keeps eating up the call stack like there is no tomorrow where even the best and most thorough recursion optimization in HtmlAgilityPack wouldn't completely prevent call stack exhaustion, then the solution lies with the application: Either reduce its call stack usage or (as a workaround) execute call stack-intensive code in individual background thread (as each thread has its own dedicated call stack).

elgonzo avatar Jun 07 '23 11:06 elgonzo