turndown Handle newlines inside strong/emphasis tags

Previously the following markup:

<em>Hello <br><br> world</em>

would get converted into the markdown:

_hello

world_

which is incorrect. Both lines should be surrounded with underscores. This pull request splits em, i,strong, andb` tags into lines and surrounds each line with the appropriate markdown

Nov 10 '15 16:11 a-warner

Thanks for this!

It’s an interesting case because I don’t think it’s possible to produce Hello world from Markdown, so it raises questions as to what the correct outcome is?! :/

Your proposed outcome is:

_Hello_
_world_

which converts back to HTML as:

<p><em>Hello</em>
<em>world</em></p>

which renders as:

Hello world

To make this a bit more robust, perhaps toMarkdown should place any number of newlines with a space in any inline element before converting it?

Nov 10 '15 22:11 domchristie

yea actually my proposed outcome for Hello world converted to markdown is actually:

_Hello_

_world_

which, translated back to html, is:

HelloWorld

which I think is right, but that's not what the old version of my code actually did. The new version I think handles this case properly.

Thoughts?

Nov 11 '15 01:11 a-warner

Thanks! I think the result of your most recent version is more inline with the outcome of the rendered HTML, however I am not entirely convinced that a converter is the correct place to handle this case.

Converters should be as simple as possible, and should contain as little logic as possible. This is unavoidable in some cases (e.g. ul/ol), but it feels like the em/strong converter is doing too much.

In this case I’m favouring stripping multiple newlines from inline elements before they are converted (i.e. somewhere in the process function). It’d mean that some accuracy may be lost in this case, but it keeps the complexity down to a minimum.

Hope that makes sense?!

Nov 11 '15 20:11 domchristie

In this case I’m favouring stripping multiple newlines from inline elements before they are converted (i.e. somewhere in the process function). It’d mean that some accuracy may be lost in this case, but it keeps the complexity down to a minimum.

So you're saying that:

foo bar should become _foo bar_? (i.e. with no newlines in there at all?)

Nov 11 '15 22:11 a-warner

Perhaps

<em>foo<br><br>bar</em>

should convert to

_foo  
bar_

which would then result in the following when converted back to HTML:

<p><em>foo <br />
bar</em></p>

This could be achieved by doing something like the following in the process function:

// Remove line-breaks caused by multiple <br>
if(isInline(node)) content = content.replace(/(  \n){2,}/g, '  \n')

Obviously it’s not perfect, but considering that it’s not possible to create the original HTML from markdown, and that it fixes the current issue (of the content not being wrapped correctly), as well as maintaining the simplicity of the converters, I feel it’s a reasonably good solution.

Has this HTML been generated by contentEdtitable by any chance? If so, you may want to check out https://github.com/guardian/scribe which normalises some of the contentEditable weirdness.

Nov 12 '15 09:11 domchristie

@domchristie as backstory, I came across this issue while trying to convert HTML generated by trix editor to markdown. In trix, if you turn on italics and hit return twice, it'll generate this markup. It's perfectly valid html, which I agree cannot reasonably be generated by markdown.

Is the goal of this library to only convert HTML that could be generated by markdown into markdown? If so then I think I agree your solution is the best. If the goal is to do a best-effort two-way conversion between the two then I think collapsing multiple newlines into one newline isn't great.

Either way I agree that the processor should probably handle this instead of the converter, good call! What if I change the processor to call inline converters once per line? Another option would be to convert   to a paragraph break, though that might be more difficult.

Nov 12 '15 16:11 a-warner

Is the goal of this library to only convert HTML that could be generated by markdown into markdown? If so then I think I agree your solution is the best. If the goal is to do a best-effort two-way conversion between the two then I think collapsing multiple newlines into one newline isn't great.

Yeh, it’s an interesting point. The goal of this library isn’t to only convert HTML that has been generated from Markdown, but to take a best-effort approach. However that’s not to say that it’s responsibility is to handle all cases including invalid or poorly authored HTML.

About , the spec says:

br elements must be used only for line breaks that are actually part of the content, as in poems or addresses.

So to have two successive  s seems rather odd, even though it is perfectly valid.

In this case of Hello world, knowing it comes from a WYSIWYG, the intention is probably to have two paragraphs: one containing Hello and the other with world, but this may not be true in all cases. If it is the case then it should probably be fixed before passing it in to to-markdown, since it’d be impossible for to-markdown to determine the intent of every sequence of markup of this kind.

The options:

Original

<p>
  <em>Hello <br><br> world</em>
</p>

Option 1.

More than one line-break is squashed to just one. Results in:

_Hello  
world_

Converted back to HTML:

<p>
  <em>Hello <br>world</em>
</p>

Option 2.

Two paragraphs are created:

_Hello_

_world_

<p>
  <em>Hello</em>
</p>

<p>
  <em>world</em>
</p>

To me they’re both incorrect! Option 1 maintains more of the original markup (since the  s may be deliberate) but is still lossy, whereas Option 2 probably spaces the text appropriately but loses the  s and adds another paragraph which alters the meaning.

So in cases like this, I’d opt for the simplest, most maintainable solution.

As I mentioned above, I don’t think it’s the responsibility of the library to fix/alter the HTML structure, but the idea of converting line-by-line sounds interesting!

Aside: Interestingly The Guardian’s Scribe creates a new paragraph on return. It’s possible to create line-breaks but harder to do so (shift+return) http://guardian.github.io/scribe/

Nov 12 '15 18:11 domchristie

I have contributed some thoughts to the issue on trix here: https://github.com/basecamp/trix/issues/75#issuecomment-156387925

Nov 13 '15 10:11 domchristie

Any news on if this will be implemented?

I have a similar issue where I'm converting html from emails to markdown. I don't have control over the markup of the emails coming through.

This markup

<div dir="ltr"><b>Here&#39;s a link:<br><br></b><a href="https://www.google.com/">https://www.google.com/</a><div><br></div><div><br></div></div>

gets converted to

**Here's a link:    **[https://www.google.com/](https://www.google.com/)

Feb 25 '20 22:02 dylanjmcdonald

turndown turndown copied to clipboard

Handle newlines inside strong/emphasis tags

Original

Option 1.

Option 2.

turndown
turndown copied to clipboard