parsedown
parsedown copied to clipboard
Parsedown does not treat emphasis tags (*, _, etc.) as literal when spaced out
Daring Fireball's documentation states:
Markdown treats asterisks (*) and underscores (_) as indicators of emphasis. (...) But if you surround an * or _ with spaces, it’ll be treated as a literal asterisk or underscore.
( https://daringfireball.net/projects/markdown/syntax#em )
So, in most Markdown parsers (including GitHub): *This is italic*
produces: This is italic
and: * This is not italic *
produces: * This is not italic *
However, the current version of Parsedown treats both as italic. This is confirmed with the current version of the http://parsedown.org/demo site, which shows Markdown PHP working as expected, and Parsedown tripping over.
Happy birthday, bug! Any further movement on this getting resolved, as it's been sitting as a bug within my application for a year now.
I can replicate this on release 1.7.1 and directly on master - but not on the demo site http://parsedown.org/demo. There it has the same behaviour as Markdown PHP 1.3 and the results of Guthub preview. Any ideas on what version is being used on the demo site?
Looks like this can be fixed by amending the $EmRegex
to exclude matches that start/end with whitespace.
The current regex is
protected $EmRegex = array(
'*' => '/^[*]((?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?)[*](?![*])/s',
'_' => '/^_((?:\\\\_|[^_]|__[^_]*__)+?)_(?!_)\b/us',
);
changing this to
protected $EmRegex = array(
'*' => '/^[*]([^\s](?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?[^\s])[*](?![*])/s',
'_' => '/^_([^\s](?:\\\\_|[^_]|__[^_]*__)+?[^\s])_(?!_)\b/us',
);
sorts it. Similarly, the $StrongRegex
ought to be updated too, currently
protected $StrongRegex = array(
'*' => '/^[*]{2}((?:\\\\\*|[^*]|[*][^*]*+[*])+?)[*]{2}(?![*])/s',
'_' => '/^__((?:\\\\_|[^_]|_[^_]*+_)+?)__(?!_)/us',
);
and following the same pattern as above this becomes
protected $StrongRegex = array(
'*' => '/^[*]([^\s](?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?[^\s])[*](?![*])/s',
'_' => '/^__(([^\s]?:\\\\_|[^_]|_[^_]*+_)+?[^\s])__(?!_)/us',
);
Testing this with
$parsedown = new Parsedown();
echo $parsedown->line('*italic* * not italic* *also not italic * *italic again*');
echo '</br>';
echo $parsedown->line('_italic_ _ not italic_ _also not italic _ _italic again_');
previously gave italic not italic also not italic italic again italic not italic also not italic italic again but now gives italic * not italic* *also not italic * italic again italic _ not italic_ _also not italic _ italic again gives the correct output now. @erusev if you agree with the above I can make a PR for it - or am I off the mark here?
I was actually just in the process of exploring a fix around moving more towards the CommonMark spec for defining the opening and closing delimiter runs: since this'll help avoid some other discrepancies. I'm hoping that it might be possible to do so in a way that doesn't require leaning on intersection rules since resolving precedence rules about intersecting inlines is currently out of scope (see https://github.com/erusev/parsedown/issues/588#issuecomment-377221179). If we're pulling apart the emphasis parsing, then it'll also be an opportunity to address some long-standing more widespread escaping issues (#434)—if we can tackle emphasis then it should be fairly easy to address the simpler inlines.
If I can't make anything nice work around the CommonMark definitions, then the solution you've outlined above would also seem address the issue raised here fairly directly. (Note that it doesn't tackle part (b) in the CommonMark definition's of left/right flanking though: https://spec.commonmark.org/0.28/#left-flanking-delimiter-run, IMO we should prefer a solution that incorporates the CommonMark definition if possible).
alec-w's response works well, but there is a minor mistake in the strong regex - I believe it should be:
protected $StrongRegex = array(
'*' => '/^[*]{2}([^\s](?:\\\\\*|[^*]|[*][*][^*]+?[*][*])+?[^\s])[*]{2}(?![*])/s',
'_' => '/^__(([^\s]?:\\\\_|[^_]|_[^_]*+_)+?[^\s])__(?!_)/us',
);
Hi guys!
Any updates related to this issue?
Thanks!
Quick update, turns out alec-w's fix breaks if you have just two characters between your tags.
i.e.: this _is_ italic
will not work correctly, nor will this **is** bold
.
Figured out a correction that seems to work without breaking with only two or one character between your emphasis tags:
protected $StrongRegex = array(
'*' => '/^[*]{2}(?:([^\*\s]|[^\W](?:\\\\\*|[^*]|[*][^*]*+[*])*?[^\W]))[*]{2}(?![*])/s',
'_' => '/^[_]{2}([^\W_]|(?:[^\W](?:\\\\_|[^_]|_[^_]*+_)*?[^\s]))[_]{2}(?!_)/s',
);
protected $EmRegex = array(
'*' => '/^[*](?:([^\*\s]|[^\W](?:\\\\\*|[^*]|[*][^*]*+[*])*?[^\W]))[*](?![*])/s',
'_' => '/^[_]([^\W_]|(?:[^\W_](?:\\\\_|[^_]|_[^_]*+_)*?[^\s_]))[_](?!_)/s',
);
Can't we fix this with a lookaround?
Here's how it'd look for $EmRegex['_']
:
- /^_((?:\\\\_|[^_]|__[^_]*__)+?)_(?!_)\b/us
+ /^_(?!\s)((?:\\\\_|[^_]|__[^_]*__)+?)(?<!\s)_(?!_)\b/us
Basically, the modification adds (?!\s)
after the opening _
and a (?<!\s)
before the closing _
Tests seem to pass.
Here's a quick demo: https://regex101.com/r/cBxRKo/1
@erusev - that's definitely a good solution. Whilst it requires more steps to complete on short phrases (e.g. _hello_
) it definitely is a significantly superior regex in terms of completion steps for longer sentences. Thanks!
Tests seem to pass.
Here's a quick demo: https://regex101.com/r/cBxRKo/1
Looks like the testing may be lacking some nuance - just realised today that a bug I found with ***Test***
(which should be bold-italic) renders as Bold with an asterisk on each side when using this change.