python-markdownify
python-markdownify copied to clipboard
Newlines not collapsed from HTML
After 97c78ef55b7a5be1d3782d393f3ccfbee1056671 was merged, the newlines in the parsed HTML are no longer collapsed into normal spaces, resulting in erroneous line breaks in the output.
import markdownify
print(repr(markdownify.markdownify("""\
continuous
line of
text
""")))
In 0.6.1 the above code outputs 'continuous line of text' like it'd look like when rendered in a browser,
while in 0.6.3 it preserves the newlines and outputs 'continuous\nline of\ntext'
This causes issues when the html is wrapped to some length or linebreaks are used to separate out tags, for example
text before link
<a href="link">link text</a>
continued text
Hi! When rendered with markdown
continuous
line of
text
could render as a single line: https://www.markdownguide.org/basic-syntax#line-breaks and https://github.github.com/gfm/#soft-line-breaks It is up to the markdown parser to handle it the way it wants:
A conforming parser may render a soft line break in HTML either as a line break or as a space.
A quick test here shows that the GitHub renderer decides that it should be a hard linebreak:
continuous line of text
I am not sure if we are up to spec here, but it seems like we are. I'm open to all feedback on this issue!
Best, Alex
I'm not very familiar with the spec here; maybe a switch to trigger the line break behaviour would be most fitting. I've noticed this issue when parsing autogenerated html from python docs, for example in the description tag in https://docs.python.org/3/library/stdtypes.html#str, there are newlines in the strings which results in quite a few additional (and unnecessary) newlines with the new handling
I'll look into that. On a related note, did you try the source of the generated docs? It's rst, which could be easily converted to markdown using pandoc or something similar: https://github.com/python/cpython/blob/master/Doc/library/stdtypes.rst Right now you convert rst converted to html to markdown.
Related is how headings and paragraphs are handled.
Example 1
md("<h2>Some Heading</h2>\n<p>Some text</p>", heading_style='ATX')
Expected:
## Some Heading\n\nSome text\n\n
Actual:
## Some Heading\n\n\nSome text\n\n
Example 2
md("<p>Paragraph 1</p>\n<p>Paragraph 2</p>")
Expected:
Paragraph 1\n\nParagraph 2\n\n
Actual:
Paragraph 1\n\n\nParagraph 2\n\n
I ran into similar issues and now use it like this in my local Feediverse clone:
def cleanup(text):
text = re.sub('\r+\n?', '\n', text)
text = re.sub(' *\n *', '\n', text)
text = text.replace('\n', '\1')
text = re.sub('\1\1\1+', '\n\n', text)
text = re.sub('\1+ *', ' ', text).strip()
text = markdownify(text, strip=['img']).strip()
text = re.sub(' \n \n', '\n\n', text)
text = re.sub(' *\n\n+', '\n\n', text)
return text
This somewhat normalises newlines in the input before handing it to markdownify (assuming no <pre> tags are present) then post-processes the output to fix more whitespace issues, assuming a renderer that creates a hard linebreak when provided with a newline in the input paragraph (which is natural most comment forms etc. but also Fediverse clients, as it allows one to mostly post naturally).
Update: the snippet above breaks whitespace in pre tags, though. I have a more complex wrapper around Markdownify now; I guess I’ll have to put my patched Feediverse online some day.