html-agility-pack
html-agility-pack copied to clipboard
Huge memory consumption
Consider the following HTML: https://code.googlesource.com/mozc/+/HEAD/src/data/dictionary_oss/dictionary05.txt
When HtmlDocument is created for it the process consumes 3Gb of memory. Is it possible to reduce this value?
Hello @aktzbn ,
We will look at it later this week to check what consume so much memory.
The table has over 150k row but that should probably not consume this amount of memory.
Best Regards,
Jonathan
Hello @aktzbn ,
Unfortunately, I don't think it's possible currently to do anything.
Your page web has 80MB of data in a Table Format. More than 2 millions HtmlNode need to be created.
I don't think there is a lot of things we can currently do. Even if we optimize a little bit the code, it will still need to consume some GB of memory.
Even if no fix is provided, let me know if that answer correctly to the issue.
Best Regards,
Jonathan
Hello @JonathanMagnan,
Thank you for response. I see.
1K per node is pretty big overhead. I would suggest to consider using structs instead of classes for numerous objects and remove some information that is required only (i think) on parsing stage (like line numbers, positions and so on).
Best Regards,
Boris
Hello @aktzbn ,
Yes, it's a pretty big overhead but a class must also be created for every CSS class && attributes.
Overall:
- 3976300 HtmlNode
- 3485104 HtmlAttributeCollection
- 3485104 Dictionary<string, HtmlAttribute>
- 3485104 List<HtmlAttribute>
- 2479370 Attribute
- 1496948 HtmlTextNode
Etc. That's a lot of stuff to keep in the memory.
There are several things that could be improved to reduce the memory in addition to your suggestion
By example, adding some options to give some hint to the parser if class & attribute should be created or only parsed as text.
Only creating dictionary && list on demand, not by default.
Here is an example of one row in the page you try to parse:
<tr class="u-pre u-monospace FileContents-line">
<td class="u-lineNum u-noSelect FileContents-lineNum" data-line-number="1" onclick="window.location.hash='#1'"></td>
<td class="FileContents-lineContents" id="1"><span class="pun">ã²ã¨ã¤ã¼</span><span class="pln"> </span><span class="lit">2520</span><span class="pln"> </span><span class="lit">1773</span><span class="pln"> </span><span class="lit">7132</span><span class="pln"> </span><span class="pun">ã²ã¨ã¤ã¼</span></td>
</tr>
That's for sure something we will work and take in consideration for the version 2.x
Best Regards,
Jonathan
Have you idea when 2.x will be available? )
Hello @aktzbn ,
Unfortunately not yet ;( I do not believe it will be ready in 2017
So many projects ongoing currently.
Best Regards,
Jonathan
Has this issue been worked on? I'm attempting to load a large Html file and I consistently get an OutOfMemoryException.
I realize the Html file is huge, but given the restrictions that I have for what I'm working on, I can't modify the file in any way but I need to parse it out to store the information in a database. Chrome also won't load the file, so I'm really not sure if there's anything that I can do in this particular instance.
Any suggestions what I should do?
The exception information is:
Exception of type 'System.OutOfMemoryException' was thrown. at HtmlAgilityPack.HtmlNodeCollection.System.Collections.Generic.IEnumerable<HtmlAgilityPack.HtmlNode>.GetEnumerator() in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNodeCollection.cs:line 168 at HtmlAgilityPack.HtmlNode.SetChildNodesId(HtmlNode chilNode) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.cs:line 918 at HtmlAgilityPack.HtmlNode.AppendChild(HtmlNode newChild) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlNode.cs:line 908 at HtmlAgilityPack.HtmlDocument.PushNodeEnd(Int32 index, Boolean close) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 2000 at HtmlAgilityPack.HtmlDocument.Parse() in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 1391 at HtmlAgilityPack.HtmlDocument.Load(TextReader reader) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 771 at HtmlAgilityPack.HtmlDocument.Load(Stream stream) in C:\Users\Jonathan\source\repos\HtmlAgilityPack\HtmlAgilityPack.Shared\HtmlDocument.cs:line 672 at xxx.Utility.ZipUtility.GenerateParsedHtml(String Pathx, List 1 physicalHtmlFiles, String InnerZip) in C:\code\xxx\xxx\xxx\Utility\ZipUtility.cs:line 180
Hello @alnewman ,
We made improvements in some area but sometimes there is nothing we can do unless we fully re-write the library in a better way to handle this.
For example, in this case where too many nodes and attribute was existing. Currently, a node is created every time instead of being "on demand".