turndown icon indicating copy to clipboard operation
turndown copied to clipboard

Merging repeated elements without any non-whitespace content between them

Open zumoshi opened this issue 5 years ago • 9 comments

many WYSIWYG HTML editors leave a lot of artifacts, spamming elements being one example. currently, turndown converts the following:

<b>a</b> <b>b</b>

to

**a** **b**

which is not wrong, but **a b** would've been preferred. the actual code examples are not so short and are spammed with repeated tags that make the output less readable.

a similar issue is about general handling of whitespace. for example:

<strong>
  a
  <br>
</strong>
b

is converted to:

**a  
**b

while the ideal output would've been:

**a**
b

I'm not sure how complex it would be to make these changes, but generally ignoring whitespace between multiple tags of the same kind, and pushing whitespace from beginning and end to the outside of tags would increase the quality of output for my main use case a lot.

thanks.

zumoshi avatar Jan 12 '19 13:01 zumoshi

Hi @zumoshi, yes this is a bit of a tricky one, and has been discussed in https://github.com/domchristie/turndown/pull/123 in particular in this comment: https://github.com/domchristie/turndown/pull/123#issuecomment-156198157

domchristie avatar Jan 21 '19 21:01 domchristie

perhaps I should've made two separate issues for newline one and merging tags. I don't think it's the same issue though. the discussion you linked to had problems figuring out the ideal markdown output since no such output existed that would've recreated the input HTML.

ignoring the first part, for now, I think the example I provided is actually a bug. turndown's demo converts the following:

<strong>
  a
  <br>
</strong>
b

into:

**a  
**b

and giving that to commonmark's demo will give:

<p>**a<br />
**b</p>

which doesn't involve bolding at all.

not all markdown implantations do this of course, but if you want to be compatible with common mark, according to section 6.4, example 347:

This is not emphasis, because the closing * is preceded by whitespace, A newline also counts as whitespace

same goes for starting delimiter of most tags. you would need to push the whitespace to outside and put delimiters right before/after the first/last non-whitespace character.

(unless I'm mistaken and the default rule-set is not based on commonmark, and I messed up something in the config of the demo to get this non-standard output)

zumoshi avatar Jan 22 '19 18:01 zumoshi

I don't think it's the same issue though. the discussion you linked to had problems figuring out the ideal markdown output since no such output existed that would've recreated the input HTML.

If I understand correctly, I think it might be related to the other issue, which discusses brs in strong/em elements. Also, as far as I'm aware, it's not possible to produce <strong>a<br></strong>b using Commonmark syntax. So as discussed in #123, the question is, what should the converted output be?

I wonder if the most pragmatic approach would be to not convert inline elements with brs in them?

domchristie avatar Jan 24 '19 09:01 domchristie

according to common mark spec 6.9, there are two ways to generate an output which results in linebreaks for the input I gave:

**a**  
b

(note the spaces after the first line) and

**a**\
b

test in demo: 1, 2

this is related to the issue you mentioned, since using this method double <br/>s is possible as well.

however, my issue is not with how the newlines are handled, rather with the placement of delimiters. if you look at the spec links I gave in my last message, it explicitly says there should be no whitespace (including newline) right after the starting delimiter, and right before ending delimiter. while turndown's generated code for that input, puts a newline as the last character inside the bold section (i.e. a newline before **) which is invalid according to commonmark's spec.

zumoshi avatar Jan 24 '19 09:01 zumoshi

however, my issue is not with how the newlines are handled, rather with the placement of delimiters.

In this case, the two are linked. The markdown examples given both result in the following when converted to HTML:

<strong>a</strong><br />
b

… which is not the same as the original:

<strong>a<br /></strong>
b

Turndown handles brs in most cases but it's not possible to generate <strong>a<br></strong>b from commonmark due to the whitespace rules in the spec.

I think there are a few possibilities for solving this issue:

  1. Manipulate the DOM to give the required structure. This could make up part of a preProcess flow, allowing developers to modify the DOM to their liking (perhaps a solution to https://github.com/domchristie/turndown/issues/272 as well as the element spamming issue you mention). I'm reluctant to manipulate the DOM by default, because the chosen markup may be deliberate.
  2. Alter the rules to handle cases like these. This is the solution in https://github.com/domchristie/turndown/pull/123 This somewhat muddies the simplicity of rules and as has similar problems to 1. in that it alters the meaning of the markup.
  3. Do not convert strong/em/code elements containing <br>. I think this is the most pragmatic approach as it leads to accurate, commonmark-compliant output without adding conditionals in the replacements

domchristie avatar Jan 25 '19 09:01 domchristie

… which is not the same as the original:

I would argue they are. Whitespace characters can't be bold. So does it really makes any difference if a newline or space is inside or outside a strong or em tag? The two htmls you provided look identical.

Right now the output doesn't result in a strong tag at all, I would prefer the br being outside, and the rendered html looking correct despite positioning of <br> tag not matching original html, compared to not getting any results at all because the WYSIWYG editor used to create the original document decided it would be a good idea to put the br inside a strong tag.

zumoshi avatar Jan 25 '19 09:01 zumoshi

I would prefer the br being outside, and the rendered html looking correct despite positioning of
tag not matching original html, compared to not getting any results at all because the WYSIWYG editor used to create the original document decided it would be a good idea to put the br inside a strong tag.

I don't think it is Turndown's responsibility to fix up poorly generated HTML. You may wish to parse the HTML string yourself, manipulate the HTML to your required structure then pass in the DOM tree to turndown. Alternatively you could override the strong rule.

domchristie avatar Jan 25 '19 10:01 domchristie

It has come up multiple times now that people want/need their HTML fixed before conversion. It may be useful to recommend a different utility that shakes out and properly rearranges HTML tag nesting, and provide a way to attach it to this one?

spirograph avatar Feb 15 '19 14:02 spirograph

It may be useful to recommend a different utility that shakes out and properly rearranges HTML tag nesting

@spirograph 👍 do you know of any libraries that will do this?

domchristie avatar Feb 24 '19 20:02 domchristie