turndown icon indicating copy to clipboard operation
turndown copied to clipboard

Handle newlines inside strong/emphasis tags

Open a-warner opened this issue 8 years ago • 9 comments

Previously the following markup:

<em>Hello <br><br> world</em>

would get converted into the markdown:

_hello

world_

which is incorrect. Both lines should be surrounded with underscores. This pull request splits em, i,strong, andb` tags into lines and surrounds each line with the appropriate markdown

a-warner avatar Nov 10 '15 16:11 a-warner

Thanks for this!

It’s an interesting case because I don’t think it’s possible to produce <em>Hello <br><br> world</em> from Markdown, so it raises questions as to what the correct outcome is?! :/

Your proposed outcome is:

_Hello_
_world_

which converts back to HTML as:

<p><em>Hello</em>
<em>world</em></p>

which renders as:

Hello world

To make this a bit more robust, perhaps toMarkdown should place any number of newlines with a space in any inline element before converting it?

domchristie avatar Nov 10 '15 22:11 domchristie

yea actually my proposed outcome for <em>Hello <br><br> world</em> converted to markdown is actually:

_Hello_

_world_

which, translated back to html, is:

<p><em>Hello</em></p><p><em>World</em></p>

which I think is right, but that's not what the old version of my code actually did. The new version I think handles this case properly.

Thoughts?

a-warner avatar Nov 11 '15 01:11 a-warner

Thanks! I think the result of your most recent version is more inline with the outcome of the rendered HTML, however I am not entirely convinced that a converter is the correct place to handle this case.

Converters should be as simple as possible, and should contain as little logic as possible. This is unavoidable in some cases (e.g. ul/ol), but it feels like the em/strong converter is doing too much.

In this case I’m favouring stripping multiple newlines from inline elements before they are converted (i.e. somewhere in the process function). It’d mean that some accuracy may be lost in this case, but it keeps the complexity down to a minimum.

Hope that makes sense?!

domchristie avatar Nov 11 '15 20:11 domchristie

In this case I’m favouring stripping multiple newlines from inline elements before they are converted (i.e. somewhere in the process function). It’d mean that some accuracy may be lost in this case, but it keeps the complexity down to a minimum.

So you're saying that:

<em>foo<br><br>bar</em> should become _foo bar_? (i.e. with no newlines in there at all?)

a-warner avatar Nov 11 '15 22:11 a-warner

Perhaps

<em>foo<br><br>bar</em>

should convert to

_foo  
bar_

which would then result in the following when converted back to HTML:

<p><em>foo <br />
bar</em></p>

This could be achieved by doing something like the following in the process function:

// Remove line-breaks caused by multiple <br>
if(isInline(node)) content = content.replace(/(  \n){2,}/g, '  \n')

Obviously it’s not perfect, but considering that it’s not possible to create the original HTML from markdown, and that it fixes the current issue (of the content not being wrapped correctly), as well as maintaining the simplicity of the converters, I feel it’s a reasonably good solution.

Has this HTML been generated by contentEdtitable by any chance? If so, you may want to check out https://github.com/guardian/scribe which normalises some of the contentEditable weirdness.

domchristie avatar Nov 12 '15 09:11 domchristie

@domchristie as backstory, I came across this issue while trying to convert HTML generated by trix editor to markdown. In trix, if you turn on italics and hit return twice, it'll generate this markup. It's perfectly valid html, which I agree cannot reasonably be generated by markdown.

Is the goal of this library to only convert HTML that could be generated by markdown into markdown? If so then I think I agree your solution is the best. If the goal is to do a best-effort two-way conversion between the two then I think collapsing multiple newlines into one newline isn't great.

Either way I agree that the processor should probably handle this instead of the converter, good call! What if I change the processor to call inline converters once per line? Another option would be to convert <br><br> to a paragraph break, though that might be more difficult.

a-warner avatar Nov 12 '15 16:11 a-warner

Is the goal of this library to only convert HTML that could be generated by markdown into markdown? If so then I think I agree your solution is the best. If the goal is to do a best-effort two-way conversion between the two then I think collapsing multiple newlines into one newline isn't great.

Yeh, it’s an interesting point. The goal of this library isn’t to only convert HTML that has been generated from Markdown, but to take a best-effort approach. However that’s not to say that it’s responsibility is to handle all cases including invalid or poorly authored HTML.

About<br>, the spec says:

br elements must be used only for line breaks that are actually part of the content, as in poems or addresses.

So to have two successive <br>s seems rather odd, even though it is perfectly valid.

In this case of <em>Hello <br><br> world</em>, knowing it comes from a WYSIWYG, the intention is probably to have two paragraphs: one containing <em>Hello</em> and the other with <em>world</em>, but this may not be true in all cases. If it is the case then it should probably be fixed before passing it in to to-markdown, since it’d be impossible for to-markdown to determine the intent of every sequence of markup of this kind.

The options:

Original

<p>
  <em>Hello <br><br> world</em>
</p>

Option 1.

More than one line-break is squashed to just one. Results in:

_Hello  
world_

Converted back to HTML:

<p>
  <em>Hello <br>world</em>
</p>

Option 2.

Two paragraphs are created:

_Hello_

_world_
<p>
  <em>Hello</em>
</p>

<p>
  <em>world</em>
</p>

To me they’re both incorrect! Option 1 maintains more of the original markup (since the <br>s may be deliberate) but is still lossy, whereas Option 2 probably spaces the text appropriately but loses the <br>s and adds another paragraph which alters the meaning.

So in cases like this, I’d opt for the simplest, most maintainable solution.

As I mentioned above, I don’t think it’s the responsibility of the library to fix/alter the HTML structure, but the idea of converting line-by-line sounds interesting!

Aside: Interestingly The Guardian’s Scribe creates a new paragraph on return. It’s possible to create line-breaks but harder to do so (shift+return) http://guardian.github.io/scribe/

domchristie avatar Nov 12 '15 18:11 domchristie

I have contributed some thoughts to the issue on trix here: https://github.com/basecamp/trix/issues/75#issuecomment-156387925

domchristie avatar Nov 13 '15 10:11 domchristie

Any news on if this will be implemented?

I have a similar issue where I'm converting html from emails to markdown. I don't have control over the markup of the emails coming through.

This markup

<div dir="ltr"><b>Here&#39;s a link:<br><br></b><a href="https://www.google.com/">https://www.google.com/</a><div><br></div><div><br></div></div>

gets converted to

**Here's a link:    **[https://www.google.com/](https://www.google.com/)

dylanjmcdonald avatar Feb 25 '20 22:02 dylanjmcdonald