turndown Allow for preserving non-semantic whitespace and softbreaks

Problem

Using the library I noted that the content of the markdown is not dealing properly with multiples whitespaces.

For example

<a href="http://www.example.com">link with     spaces</a>

converts to

[link with spaces](http://www.example.com)

not respecting the content of the html element.

Replicable example

https://codesandbox.io/s/markdown-to-jsxturndown-1yrfn?file=/src/index.js

I've used markdown-to-jsx to convert from markdown to html and back using turndown but the result is not always the same. In the example, I've used the css rule white-space: pre-wrap; that helps React to preserve multiple whitespaces with jsx.

Possible solution

Using an option as

collapseMultipleWhitespaces: false

that allow the library to disable the collapsing of series of multiple whitespaces, and eventually reduce to one whitespace for leading and trailing.

See related PR #362

Jan 22 '21 06:01 eliascotto

Using the library I noted that the content of the markdown is not dealing properly with multiples whitespaces.

Turndown just follows the common HTML whitespace interpretation. Why do you think it is improper?

The thing is that whitespace collapsing behaves differently in Markdown and HTML and there is no easy universal way to transfer arbitrary HTML whitespace syntax to Markdown. So we at least try hard to convert it to a semantic equivalent. Will post more details to your PR.

What is your use case for this?

Jan 22 '21 14:01 martincizek

The HTML standard introduce the possibility to avoid multiple whitespace trimming using white-space css rule. So the browser is able to render elements keeping multiple whitespaces on the inner text.

My use case is the ability to render markdown code from an <div contenteditable="true"> to html and to markdown again using turndown. So keeping a 1=1 with the editor.

I need to use turndown not as a markdown library (that render markdown content to html) but as a html to markdown converter. So the conversion, with options, should be able to convert exactly what is rendered inside the HTML, not applying custom rules or interpretation of the content.

Jan 22 '21 20:01 eliascotto

I know this isn't the best solution, but I normalize the target element text nodes whitespaces before getting the markdown, and them back again to plain white spaces

export function getTextNodesIn(el: HTMLElement | Text | Node) {
  const textNodes: Text[] = []
  if (el.nodeType === 3) {
    textNodes.push(el as Text)
  } else {
    const children = el.childNodes
    for (let i in children) {
      textNodes.push(...getTextNodesIn(children[i]))
    }
  }
  return textNodes
}

// @NOTE: Convert white space to &nbsp; before getting markdown, since turndown ignores multiple spaces
function normalizeWhiteSpace(el: HTMLElement) {
  const target = el.cloneNode(true) as HTMLElement
  
  target.normalize()

  const nodes = getTextNodesIn(target)
  for (let i in nodes) {
    if (nodes[i].textContent) nodes[i].textContent = nodes[i].textContent?.replace(/ /g, '\u00a0') || ''
  }

  return target
}

const endSpaceRe = new RegExp('\u00a0', 'g')
const normalizedTarget = normalizeWhiteSpace(target)
let result = service.turndown(normalizedTarget).replace(endSpaceRe, ' ')

Jan 23 '21 13:01 bruno2ms

The HTML standard introduce the possibility to avoid multiple whitespace trimming using white-space css rule.

Indeed. But I came across just single worthwhile use case so far - the already mentioned preformatted inline code (#318, PR #319).

So keeping a 1=1 with the editor.

As I mentioned at your PR, it is not achievable in general. E.g. you can't mirror two consecutive newlines as softbreaks to Markdown, as it will become a paragraph, etc. I can imagine preserving whitespace runs (except newlines, i.e. [^\S\r\n]) that are surrounded by non-whitespace characters ([\S]) from both sides. Please note that Markdown and HTML interprets Unicode whitespace differently, so replacing e.g. runs of [ \t] can have side effects - that's why the \s / \S.

And a related thing, which can be reasonably demanded in relation to this, is somehow preserving softbreaks. That would be much more tricky. :/

The issue is that whitespace collapsing is deep inside the Turndown's core and we should be pretty sure that we don't break anything, even though it is configuration based.

So your placeholder pre/postprocessing for such special use cases[*] is probably a good workaround for now. Some suggestions related to your code:

What does it do when you convert <p>foo</p> using your preprocessor + Turndown and render the resulting MD back to HTML? :-) ... You can use the above regexp suggestions in conjunction with lookahead matching to get safer results.
It's better to use a more unique placeholders, either a random strings or reserved unicode characters (which would preserve text length for e.g. setext headers).
To be on the safe side, it is better to keep the boundary whitespace around placeholders (if possible). This would ensure minimum change in processor semantics.

[*] Don't like when maintainers tell contributors their use case is special. But I wish that the developers who enabled spacebar-formatting in MS Word were never born :D

And seriously - do you really use white-space property in CSS for regular text (i.e. neither code block, nor inline code)? I've also tried the default behaviour of <div editablecontent="true">, typed foo, three normal spaces and bar and it produced foo   bar... This is also the default behavior of WYSIWYG editors.

I've also realized that @domchristie hasn't yet released the latest master, so you can check it out manually - there is the improved Unicode whitespace treatment and the preformattedCode option that does the best possible for inline code spans. If you are interested in the deep background of the topic, there is also this whitespace treatment analysis - with its results already in master...

Jan 23 '21 19:01 martincizek

Okay, got it. To give you a real use case, notion.so created a wysiwyg editor with only <div contenteditable="true" style="white-space:pre-wrap;">. If you copy the content gives you a formatted markdown with multiple whitespaces. They might not directly convert the html but they have this possibility in the editor and in the exported content. I will found a workaround 👍

Jan 24 '21 03:01 eliascotto

@elias94 I actually thought the @bruno2ms code snippet was yours, sorry about confusing the reply. Bruno's approach can be used with the tweaks I've described.

I've reopened the issue, as it is legit. It's more about benefit vs. introduced complexity and priority indeed. Can I see the notion.so in action eventually?

Internal remark: the most tangible reason for legitimity is probably backtranslation preserving. For whitespace and softbreaks that are actually rendered by a CommonMark renderer from pure-Markdown source, it makes sense to have opportunity to convert them back.

Jan 24 '21 20:01 martincizek

Hello @martincizek, first, kudos for Turndown, it's an amazing tool, and I really appreciate your effort in getting things done right without introducing unwarranted bugs. With that said, I'd like to add my case for reviving the feature that @elias94 tried to implement with #362.

We're building a new text editor based on Tiptap, but we need to store Markdown, and this is where Turndown comes into play for us. But we are not only building a rich-text editor, but also an augmented plain-text editor. Before I get into that last part, let me just state that Turndown works perfectly for the rich-text editor mode, it takes the HTML generated by Tiptap, and with a few tweaks to some rules, we get the Markdown output we need.

Now, what's this "augmented plain-text editor" thing? It's still a Tiptap-powered editor but only with minimal extensions to provide a plain-text mode, but it also includes custom extensions tailored to our needs (for instance, user mentions). In other words, this editor only allows for a single root <p> element, multiple <br> elements, text (of course), and outputs specific HTML from our custom extensions; it doesn't allow any other HTML element like <strong>, <ul>/<li>, etc. And, just like we are using Turndown for the rich-text mode, we also need to use it for the plain-text mode to parse the specific HTML from our custom extensions, which are shared between both modes.

As you might've guessed by now, this plain-text editor is meant to be an augmented Markdown-based editor (similar to the GitHub editor we can see right here in the comment box). The problem? A Markdown editor cannot collapse whitespace (think of nested lists).

In a nutshell, Turndown is working great for our rich-text editor, but not so much for our plain-text Markdown editor. The option @elias94 introduced would've worked nicely for us, and I'm here requesting if something like that can be added (maybe as an advanced/hidden option, or some sort of hook where we can tweak the collapse whitespace regex/function?). I understand your concerns about adding it as it was presented in the PR, but I don't think those concerns would apply to our plain-text Markdown editor (please correct me if I'm wrong).

To finish, because this is a dealbreaker for us, I had to introduce a similar workaround to https://github.com/mixmark-io/turndown/issues/361#issuecomment-766083490. I'm passing an HTML string to the turndown function (and not a node like in the comment above), and I do something like this:

// Convert all instances of at least 2 standard space characters to &nbsp;
const htmlInput = input.replace(/ {2,}/g, (m) => m.replace(/ /g, '\u00a0'))

// Convert all &nbsp; characters back to standard spaces
const result turndownService.turndown(htmlInput).replace(/\u00a0/g, ' ')

I haven't tested this extensively yet, but it looks to be working for now. Nonetheless, would love your feedback on this.

Dec 20 '21 19:12 rfgamaral

turndown turndown copied to clipboard

Allow for preserving non-semantic whitespace and softbreaks

Problem

Replicable example

Possible solution

turndown
turndown copied to clipboard