html-agility-pack icon indicating copy to clipboard operation
html-agility-pack copied to clipboard

Huge memory consumption

Open aktzbn opened this issue 7 years ago • 8 comments

Consider the following HTML: https://code.googlesource.com/mozc/+/HEAD/src/data/dictionary_oss/dictionary05.txt

When HtmlDocument is created for it the process consumes 3Gb of memory. Is it possible to reduce this value?

aktzbn avatar Sep 18 '17 08:09 aktzbn

Hello @aktzbn ,

We will look at it later this week to check what consume so much memory.

The table has over 150k row but that should probably not consume this amount of memory.

Best Regards,

Jonathan

JonathanMagnan avatar Sep 18 '17 12:09 JonathanMagnan

Hello @aktzbn ,

Unfortunately, I don't think it's possible currently to do anything.

Your page web has 80MB of data in a Table Format. More than 2 millions HtmlNode need to be created.

I don't think there is a lot of things we can currently do. Even if we optimize a little bit the code, it will still need to consume some GB of memory.

Even if no fix is provided, let me know if that answer correctly to the issue.

Best Regards,

Jonathan

JonathanMagnan avatar Sep 19 '17 01:09 JonathanMagnan

Hello @JonathanMagnan,

Thank you for response. I see.
1K per node is pretty big overhead. I would suggest to consider using structs instead of classes for numerous objects and remove some information that is required only (i think) on parsing stage (like line numbers, positions and so on).

Best Regards,

Boris

aktzbn avatar Sep 19 '17 05:09 aktzbn

Hello @aktzbn ,

Yes, it's a pretty big overhead but a class must also be created for every CSS class && attributes.

Overall:

  • 3976300 HtmlNode
  • 3485104 HtmlAttributeCollection
  • 3485104 Dictionary<string, HtmlAttribute>
  • 3485104 List<HtmlAttribute>
  • 2479370 Attribute
  • 1496948 HtmlTextNode

Etc. That's a lot of stuff to keep in the memory.

There are several things that could be improved to reduce the memory in addition to your suggestion

By example, adding some options to give some hint to the parser if class & attribute should be created or only parsed as text.

Only creating dictionary && list on demand, not by default.

Here is an example of one row in the page you try to parse:

<tr class="u-pre u-monospace FileContents-line">
	<td class="u-lineNum u-noSelect FileContents-lineNum" data-line-number="1" onclick="window.location.hash='#1'"></td>
	<td class="FileContents-lineContents" id="1"><span class="pun">ひとつぼ</span><span class="pln">	</span><span class="lit">2520</span><span class="pln">	</span><span class="lit">1773</span><span class="pln">	</span><span class="lit">7132</span><span class="pln">	</span><span class="pun">ひとつぼ</span></td>
</tr>

That's for sure something we will work and take in consideration for the version 2.x

Best Regards,

Jonathan

JonathanMagnan avatar Sep 19 '17 14:09 JonathanMagnan

Have you idea when 2.x will be available? )

aktzbn avatar Sep 20 '17 11:09 aktzbn

Hello @aktzbn ,

Unfortunately not yet ;( I do not believe it will be ready in 2017

So many projects ongoing currently.

Best Regards,

Jonathan

JonathanMagnan avatar Sep 20 '17 12:09 JonathanMagnan

Has this issue been worked on? I'm attempting to load a large Html file and I consistently get an OutOfMemoryException.

I realize the Html file is huge, but given the restrictions that I have for what I'm working on, I can't modify the file in any way but I need to parse it out to store the information in a database. Chrome also won't load the file, so I'm really not sure if there's anything that I can do in this particular instance.

Any suggestions what I should do?

The exception information is:

Exception of type 'System.OutOfMemoryException' was thrown. at HtmlAgilityPack.HtmlNodeCollection.System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>.GetEnumerator() in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNodeCollection.cs:line 168 at HtmlAgilityPack.HtmlNode.SetChildNodesId(HtmlNode chilNode) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.cs:line 918 at HtmlAgilityPack.HtmlNode.AppendChild(HtmlNode newChild) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.cs:line 908 at HtmlAgilityPack.HtmlDocument.PushNodeEnd(Int32 index, Boolean close) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 2000 at HtmlAgilityPack.HtmlDocument.Parse() in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 1391 at HtmlAgilityPack.HtmlDocument.Load(TextReader reader) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 771 at HtmlAgilityPack.HtmlDocument.Load(Stream stream) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 672 at xxx.Utility.ZipUtility.GenerateParsedHtml(String Pathx, List 1 physicalHtmlFiles, String InnerZip) in C:\code\xxx\xxx\xxx\Utility\ZipUtility.cs:line 180

alnewman avatar Jun 26 '19 19:06 alnewman

Hello @alnewman ,

We made improvements in some area but sometimes there is nothing we can do unless we fully re-write the library in a better way to handle this.

For example, in this case where too many nodes and attribute was existing. Currently, a node is created every time instead of being "on demand".

JonathanMagnan avatar Jun 27 '19 02:06 JonathanMagnan