html-to-markdown
html-to-markdown copied to clipboard
🐛 Bug <br> is converted into two new lines (\n\n)
Describe the bug
In my testing I've found that the HTML tag <br /> gets turned into two new lines (\n\n);
Example:
(⎈ |local:default)
prologic@Jamess-iMac
Mon Aug 02 11:37:55
~/tmp/html2md
(master) 130
$ ./html2md -i
Hello<br />World
Hello
World
HTML Input
Hello<br />World
Generated Markdown
Hello
World
Expected Markdown
Hello
World
Additional context
Is there any way to control this behaviour? I get that this might be getting interpreted as a "paragraph", but I would only expect that if there are two <br />(s) or an actual paragraph <p>...</p>. Thanks!
This is expected behavior. A line break in Markdown requires two newline characters. A single newline character will not render as a line break, instead it will render as a space.
According to this page (https://www.markdownguide.org/basic-syntax) a newline in markdown shall be formatted as follows: To create a line break or new line (<br>), end a line with two or more spaces, and then type return.
I have also seen implementations where <br> and <p></p> are converted to one and two newlines (as prologic recommends).
I don't know if there is a real standard for this. However, <br> must be treaded differently than <p></p> for not to loose information when converting from html to md.
Take this HTML as the input:
<p>Line 1<br />Line 2</p>
With html-to-markdown and the normal commonmark behaviour for "br" with two newlines we get:
Line 1
Line 2
With Commonmark (see playground) this renders as:
<p>Line 1</p>
<p>Line 2</p>
If you add a custom rule for "br" that just returns a single newline with:
return String("\n")
You get this ouput:
Line 1
Line 2
With Commonmark (see playground) this renders as:
<p>Line 1
Line 2</p>
If we compare the different implementations (see babelmark) this behaviour is mostly shared between implementations.
The markdown rendering on github.com works differently however 🤷♂️
If we want to be extra precise, the html-to-markdown library would need to also support hard line breaks. However that would require some other changes.
So for now, the current behaviour is going to stay as it is. Changing it would break it for other implementations. However you are free to change the behaviour, by writing a very simple custom rule.
The markdown rendering on github.com works differently however
Then can we have the GitHub-flavored markdown to use single line breaks please?
(without the need of hard line breaks, as the GitHub-flavored markdown is supposed to be tailored towards github.com)
And the change would be minimum I'd presume. IE changing from output \n\n, to do the following instead:
output "\n"
if (not in the GitHub-flavored markdown mode) output "\n"
Thanks
Then can we have the GitHub-flavored markdown to use single line breaks please?
There are other renderers — like the GitHub Flavored Markdown Extension from goldmark — that also implement the spec. And I don't want to break those.
Right now, it seems like its only github.com that causes the problem...
Then can we have the GitHub-flavored markdown to use single line breaks please?
There are other renderers — like the GitHub Flavored Markdown Extension from goldmark — that also implement the spec. And I don't want to break those.
Right now, it seems like its only github.com that causes the problem...
What about an additional built-in rule for these linebreaks? @suntong seems to be against the idea of altering the behavior of using this project GFM's plugin or adding a new parameter to accomplish this.
@suntong I'm doubting you want a PR of this but: https://github.com/ImportTaste/html2md/commit/082a6fb51863893a955aa3d59bf241224c48fe0b
Works well for me. I really don't think @JohannesKaufmann is going to budge.
NP, I'd love to, since it works well for you, and also because I'd agree with you that such feature might never be accepted here. So, send the PR pls.