php-markdown Recursion limits hit more often in PHP 7.3+

With the major release of PHP 7.3 there were some big changes under the hood with the regular expression engine, going from PCRE to PCRE2. While it is not explicitly called out in the changes, I noticed that this new PCRE2 engine is a lot more susceptible to the PREG_RECURSION_LIMIT_ERROR error when using MarkdownExtra. Let's dig in to the details.

pcre.recursion_limit is the INI setting that controls:

(Default 100000) PCRE's recursion limit. Please note that if you set this value to a high number you may consume all the available process stack and eventually crash PHP (due to reaching the stack size limit imposed by the Operating System).

In PHP 7.2 and below, we had actually decreased the 100,000 default to 10,000 for performance reasons (to stop out-of-control user content from slowing down PHP). But in PHP 7.3, having this 10,000 number now fails on pieces of content that did not fail before. This led me to think that PHP 7.3 is more susceptible to this.

What happens when the recursion limit is hit?

Simple, the preg_split call returns false, which this library doesn't expect, so the code fails at a later line due to unexpected data type (false is not Countable, also a fairly recent & strict PHP error).

Example Content

But what specifically is in the content that causes us to hit this limit? It turns out that it has to do with dangling < in the markdown -- for example, <3 for a heart symbol, or doing a math comparison like 4 < 8 or really any other < that is not part of an HTML tag. That alone isn't a problem, you also need to have "a lot" of content after the <. I tested truncating the markdown to various lengths and it wouldn't fail until it was sufficiently long enough. For the INI setting of 10,000 this failed at a document length of around 5k chars where the <3 was near the top. In fact I'll attach the file here testpost.txt. to test, run with:

echo "<?php  \Michelf\MarkdownExtra::defaultTransform(file_get_contents('testpost.txt'));" > testing.php;
php -dauto_prepend_file=vendor/autoload.php -dpcre.recursion_limit=10000 testing.php

If I understand the code correctly, it seems that it is trying to find HTML tags by looking for the starting < and then trying to find the ending >. But what if there never is an ending >? That seems to be our problem. I noticed that if I preprocessed the file to replace <3 with <3, the problem went away and we did not hit the recursion limit error.

Solutions

So what can we do here? For starters as a bare minimum, I'd like MarkdownExtra to detect regular expression errors such as this, and throw them as exceptions, so at least we can know what went wrong instead of the code failing at some arbitrary spot. I have an example prototype here demonstrating this which I could open as a PR. I noticed that this library does not throw exceptions at all, so this would be a first; I'm not sure if there is a more preferred way of handling this kind of error, such as maybe returning a blank string, or maybe returning the original text unmodified.

In addition to a change like that (or instead of), I could simply just increase the recursion limit back to PHP's default of 100,000 and observe the performance, as PHP has gotten a lot faster since we originally lowered the setting.

However, I also wonder if some algorithm improvement could be made in Markdown to be able to detect dangling < such as these which are clearly not HTML tags, so it doesn't end up exhausting the recursion depth. I think even with the default of 100,000 people will start running into this issue more with PHP 7.3+ for especially large files.

Would love your thoughts on this! Thanks.

Oct 27 '20 16:10 michaelbutler

I think this fell off my radar and I forgot to reply. Sorry.

I think your analysis is quite good.

Throwing seems like a good idea. I'd suggest we do this by adding a private preg_split wrapper instead of duplicating error handling code everywhere. Maybe this should to be gated behind a configuration variable to avoid a sudden change in behavior, or maybe it should be the reverse, I'm not sure.

Beside that, the solution is to fix the problematic regex so it does not recurse that much. At a quick glance, I find this line a bit suspect:

.+?				# Anything but quotes and `>`.

Maybe we could write it to be more strict and not depend on the surrounding expression as:

[^'">]+?				# Anything but quotes and `>`.

There might be other, but this one sounds like it could be the culprit. I'll have to investigate a bit more.

Thank you for the detailed report.

Dec 15 '20 12:12 michelf

Interestingly I can no longer reproduce the issue if I enable JIT:

php -dpcre.jit=1 -dauto_prepend_file=vendor/autoload.php -dpcre.recursion_limit=10 testing.php

Compare with:

$ php -dpcre.jit=0 -dauto_prepend_file=vendor/autoload.php -dpcre.recursion_limit=10 testing.php
PHP Notice:  Trying to access array offset on value of type bool in Michelf/MarkdownExtra.php on line 499
PHP Warning:  count(): Parameter must be an array or an object that implements Countable in Michelf/MarkdownExtra.php on line 502

Feb 18 '22 15:02 sanmai

I found that converter also breaks down on long unquoted attributes such as <img src=data:image/png;base64,iVBO.....>

Quoting the attribute such as in <img src="data:image/png;base64,iVBO....."> fixes the problem.

Feb 22 '22 04:02 sanmai