SmartReader
SmartReader copied to clipboard
ConvertToPlaintext performance enhancements
The methods ConvertToPlaintext and ConvertToText can do with a few small performance improvements. We found a few HTML pages where those methods would take 10 minutes to run, and after doing some profiling these are the changes we made (the execution went form 10 minutes to milliseconds):
ConvertToText
- Use StringBuilder instead of StringWriter, the latter is more resource hungry
- Don't return a value (StringBuilder.ToString()), make the function void, keep the same instance of the StringBuilder during the recursion
ConvertToPlaintext
- Don't invoke the Remove method on the second StringBuilder, this method is very expensive, instead append character by character to the new StringBuilder if the character meets the conditions you already have
- Pre-compile and re-use the regular expressions
Thanks for the library!
Thanks for your feedback. I will work on it. I understand the issue, 10 minutes is a lot of time. Could you indicate which pages took so much to convert to text, so we could have a test case?
You are welcome! See below the URLs we were having problems with, but the slowdown would only happen when running the requests in parallel inside an ASP.NET app using TPL,for some reason the StringWriter class would use large amounts of memory and cause a thread pool starvation. But if we requested the URLs individually, that problem wouldn't happen. And simply changing from a StringWriter to a StringBuilder solved the main problem, the other changes are minor enhancements.
https://www.heconomia.es/volatil.asp?o=1513041124 https://www.heconomia.es/volatil.asp?o=1625032559 https://www.heconomia.es/volatil.asp?o=-1348189962 https://www.heconomia.es/volatil.asp?o=1799106153 https://www.heconomia.es/volatil.asp?o=1888223065 https://www.heconomia.es/volatil.asp?o=1678194698 https://heconomia.es/volatil.asp?o=1680517915 https://www.heconomia.es/volatil.asp?o=1680517915 https://heconomia.es/volatil.asp?o=1437850236
I hope this helps!
By the way, this is the logic we use to Append to a new StringBuilder instead of removing from the existing one, I think it is equivalent to yours, but take a look:
var stringBuilder = new StringBuilder();
while (index < text.Length)
{
var c = text[index];
// carriage return and line feed are not separator characters
bool isSpace = char.IsSeparator(c);
bool isNewline = c is '\r' or '\n';
if (isSpace)
c = ' ';
else if (isNewline)
c = '\n';
if (!(previousNewline && isSpace) && !(previousSpace && isSpace) && !(isSpace && text[index + 1] is '\r' or '\n'))
stringBuilder.Append(c);
index++;
previousSpace = isSpace;
previousNewline = isNewline;
}
Thanks for providing the additional context. I implemented your suggestions. I am not sure what is causing the problems, since from what I understand StringWriter
is just a StringBuilder
with some additional methods.
Your logic is mostly equivalent, but it normalizes newlines to \n
, which is not what is happening right now. To be fair, testing the changes, I think that internally AngleSharp (the HTML parsing library we use) does already do that, so that would not be a big deal. However, it is slightly different from our current code.