SmartReader ConvertToPlaintext performance enhancements

The methods ConvertToPlaintext and ConvertToText can do with a few small performance improvements. We found a few HTML pages where those methods would take 10 minutes to run, and after doing some profiling these are the changes we made (the execution went form 10 minutes to milliseconds):

ConvertToText

Use StringBuilder instead of StringWriter, the latter is more resource hungry
Don't return a value (StringBuilder.ToString()), make the function void, keep the same instance of the StringBuilder during the recursion

ConvertToPlaintext

Don't invoke the Remove method on the second StringBuilder, this method is very expensive, instead append character by character to the new StringBuilder if the character meets the conditions you already have
Pre-compile and re-use the regular expressions

Thanks for the library!

Apr 27 '24 21:04 malv007

Thanks for your feedback. I will work on it. I understand the issue, 10 minutes is a lot of time. Could you indicate which pages took so much to convert to text, so we could have a test case?

Apr 28 '24 14:04 gabriele-tomassetti

You are welcome! See below the URLs we were having problems with, but the slowdown would only happen when running the requests in parallel inside an ASP.NET app using TPL,for some reason the StringWriter class would use large amounts of memory and cause a thread pool starvation. But if we requested the URLs individually, that problem wouldn't happen. And simply changing from a StringWriter to a StringBuilder solved the main problem, the other changes are minor enhancements.

https://www.heconomia.es/volatil.asp?o=1513041124 https://www.heconomia.es/volatil.asp?o=1625032559 https://www.heconomia.es/volatil.asp?o=-1348189962 https://www.heconomia.es/volatil.asp?o=1799106153 https://www.heconomia.es/volatil.asp?o=1888223065 https://www.heconomia.es/volatil.asp?o=1678194698 https://heconomia.es/volatil.asp?o=1680517915 https://www.heconomia.es/volatil.asp?o=1680517915 https://heconomia.es/volatil.asp?o=1437850236

I hope this helps!

By the way, this is the logic we use to Append to a new StringBuilder instead of removing from the existing one, I think it is equivalent to yours, but take a look:

      var stringBuilder = new StringBuilder();

      while (index < text.Length)
      {
        var c = text[index];
        // carriage return and line feed are not separator characters
        bool isSpace = char.IsSeparator(c);
        bool isNewline = c is '\r' or '\n';
        if (isSpace)
          c = ' ';
        else if (isNewline)
          c = '\n';

        if (!(previousNewline && isSpace) && !(previousSpace && isSpace) && !(isSpace && text[index + 1] is '\r' or '\n'))
          stringBuilder.Append(c);

        index++;

        previousSpace = isSpace;
        previousNewline = isNewline;
      }

Apr 28 '24 20:04 malv007

Thanks for providing the additional context. I implemented your suggestions. I am not sure what is causing the problems, since from what I understand StringWriter is just a StringBuilder with some additional methods.

Your logic is mostly equivalent, but it normalizes newlines to \n, which is not what is happening right now. To be fair, testing the changes, I think that internally AngleSharp (the HTML parsing library we use) does already do that, so that would not be a big deal. However, it is slightly different from our current code.

Apr 29 '24 16:04 gabriele-tomassetti

SmartReader SmartReader copied to clipboard

ConvertToPlaintext performance enhancements

SmartReader
SmartReader copied to clipboard