marky-markdown
marky-markdown copied to clipboard
HTML-wrapped code blocks contents are parsed
GitHub's Markdown parser doesn't seem to parse backticks-contained text even when put inside of HTML; the following Markdown:
Some content
<div>`<style>`</div>
More content - invisible.
will render just fine while marky-markdown cuts everything off after the <style> part. This broke readme of Webpack's style-loader; see https://www.npmjs.com/package/style-loader & https://github.com/webpack-contrib/style-loader/pull/227.
Interesting case! Thanks for reporting 👍
What's happening here is, everything in the HTML block is being left alone, and since the block is basically malformed HTML, the second backtick gets interpreted as being style content inside the <style> tag, and then the HTML santizer is stripping out <style> elements, so you only end up with <div>`</div> HTML output.
I verified this in the live marky tester by changing <style> to <br>, <hr>, etc… Those tags are allowed by the sanitizer, so everything renders as HTML. If I turn the sanitize option off, the <style> makes it through OK, but since the < and > aren't escaped, the browser renders the tag as being invisible, like a normal <style> tag. Whew!
GitHub's behavior here is kind of fascinating. At least in a gist, <br> and <hr> inside backticks render the same way marky does—the tags just get interpreted as normal HTML. <style> renders as you saw above, but <body> just disappears, <head> disappears, <blargh> disappears…like they just render as consecutive backticks. In fact, we can even eliminate the backticks, and GitHub renders<style> as though it was escaped properly. Looks like the backticks are a red herring? 😳
At first blush, it almost looks like <style> is some kind of special case. I'm going to have to do a more exhaustive test to see which tags behave similarly, and we'll have to match behavior.
I created a demonstration gist showing the rendering results of embedding different HTML tags. Looks like we need to special-case the following tags:
<iframe><script><style><textarea><title>
Great analysis! You're right backticks were a red herring, I stripped them all from your gist and results are the same.
Looks like most of these are fairly straightforward to implement, but there are a couple things to note:
- Our tests explicitly check that
<script>is allowed when we're executed withsanitize: false. Nothing magic about<script>in particular; it's just an example of something that would be normally stripped out by the sanitizer. So I'm thinking maybe since turning the sanitizer off is, in a way, opting out of 100% strict GH compat, what about skipping this HTML escaping process in the case ofsanitize: false? - The sanitizer is configured to strip iframes unless they're pointing to youtube URLs. IIRC we still need that capability because the npm docs have embedded YT vids. Is that still the case? It looks like GH compat is to always escape
<iframe>tags in HTML blocks no matter what thesrcpoints to. Should we implement the GH version, and allow for the {YT-only, unescaped} version via some combination of options?
@ashleygwilliams any thoughts?