marky-markdown HTML-wrapped code blocks contents are parsed

GitHub's Markdown parser doesn't seem to parse backticks-contained text even when put inside of HTML; the following Markdown:

Some content

<div>`<style>`</div>

More content - invisible.

will render just fine while marky-markdown cuts everything off after the <style> part. This broke readme of Webpack's style-loader; see https://www.npmjs.com/package/style-loader & https://github.com/webpack-contrib/style-loader/pull/227.

May 05 '17 14:05 mgol

Interesting case! Thanks for reporting 👍

What's happening here is, everything in the HTML block is being left alone, and since the block is basically malformed HTML, the second backtick gets interpreted as being style content inside the <style> tag, and then the HTML santizer is stripping out <style> elements, so you only end up with <div>`</div> HTML output.

I verified this in the live marky tester by changing <style> to <br>, <hr>, etc… Those tags are allowed by the sanitizer, so everything renders as HTML. If I turn the sanitize option off, the <style> makes it through OK, but since the < and > aren't escaped, the browser renders the tag as being invisible, like a normal <style> tag. Whew!

GitHub's behavior here is kind of fascinating. At least in a gist, <br> and <hr> inside backticks render the same way marky does—the tags just get interpreted as normal HTML. <style> renders as you saw above, but <body> just disappears, <head> disappears, <blargh> disappears…like they just render as consecutive backticks. In fact, we can even eliminate the backticks, and GitHub renders<style> as though it was escaped properly. Looks like the backticks are a red herring? 😳

At first blush, it almost looks like <style> is some kind of special case. I'm going to have to do a more exhaustive test to see which tags behave similarly, and we'll have to match behavior.

May 05 '17 15:05 revin

I created a demonstration gist showing the rendering results of embedding different HTML tags. Looks like we need to special-case the following tags:

<iframe>
<script>
<style>
<textarea>
<title>

May 05 '17 17:05 revin

Great analysis! You're right backticks were a red herring, I stripped them all from your gist and results are the same.

May 06 '17 18:05 mgol

Looks like most of these are fairly straightforward to implement, but there are a couple things to note:

Our tests explicitly check that <script> is allowed when we're executed with sanitize: false. Nothing magic about <script> in particular; it's just an example of something that would be normally stripped out by the sanitizer. So I'm thinking maybe since turning the sanitizer off is, in a way, opting out of 100% strict GH compat, what about skipping this HTML escaping process in the case of sanitize: false?
The sanitizer is configured to strip iframes unless they're pointing to youtube URLs. IIRC we still need that capability because the npm docs have embedded YT vids. Is that still the case? It looks like GH compat is to always escape <iframe> tags in HTML blocks no matter what the src points to. Should we implement the GH version, and allow for the {YT-only, unescaped} version via some combination of options?

@ashleygwilliams any thoughts?

May 15 '17 19:05 revin

marky-markdown marky-markdown copied to clipboard

HTML-wrapped code blocks contents are parsed

marky-markdown
marky-markdown copied to clipboard