python-markdownify Newlines not collapsed from HTML

trafficstars

After 97c78ef55b7a5be1d3782d393f3ccfbee1056671 was merged, the newlines in the parsed HTML are no longer collapsed into normal spaces, resulting in erroneous line breaks in the output.

import markdownify

print(repr(markdownify.markdownify("""\
continuous
line of
text
""")))

In 0.6.1 the above code outputs 'continuous line of text' like it'd look like when rendered in a browser, while in 0.6.3 it preserves the newlines and outputs 'continuous\nline of\ntext' This causes issues when the html is wrapped to some length or linebreaks are used to separate out tags, for example

text before link
<a href="link">link text</a>
continued text

Jan 24 '21 14:01 Numerlor

Hi! When rendered with markdown

continuous
line of
text

could render as a single line: https://www.markdownguide.org/basic-syntax#line-breaks and https://github.github.com/gfm/#soft-line-breaks It is up to the markdown parser to handle it the way it wants:

A conforming parser may render a soft line break in HTML either as a line break or as a space.

A quick test here shows that the GitHub renderer decides that it should be a hard linebreak:

continuous line of text

I am not sure if we are up to spec here, but it seems like we are. I'm open to all feedback on this issue!

Best, Alex

Jan 30 '21 09:01 AlexVonB

I'm not very familiar with the spec here; maybe a switch to trigger the line break behaviour would be most fitting. I've noticed this issue when parsing autogenerated html from python docs, for example in the description tag in https://docs.python.org/3/library/stdtypes.html#str, there are newlines in the strings which results in quite a few additional (and unnecessary) newlines with the new handling

Feb 02 '21 20:02 Numerlor

I'll look into that. On a related note, did you try the source of the generated docs? It's rst, which could be easily converted to markdown using pandoc or something similar: https://github.com/python/cpython/blob/master/Doc/library/stdtypes.rst Right now you convert rst converted to html to markdown.

Feb 07 '21 19:02 AlexVonB

Related is how headings and paragraphs are handled.

Example 1

md("<h2>Some Heading</h2>\n<p>Some text</p>", heading_style='ATX')

Expected:

## Some Heading\n\nSome text\n\n

Actual:

## Some Heading\n\n\nSome text\n\n

Example 2

md("<p>Paragraph 1</p>\n<p>Paragraph 2</p>")

Expected:

Paragraph 1\n\nParagraph 2\n\n

Actual:

Paragraph 1\n\n\nParagraph 2\n\n

Jun 27 '21 11:06 IlyaBizyaev

I ran into similar issues and now use it like this in my local Feediverse clone:

def cleanup(text):
    text = re.sub('\r+\n?', '\n', text)
    text = re.sub(' *\n *', '\n', text)
    text = text.replace('\n', '\1')
    text = re.sub('\1\1\1+', '\n\n', text)
    text = re.sub('\1+ *', ' ', text).strip()
    text = markdownify(text, strip=['img']).strip()
    text = re.sub('  \n  \n', '\n\n', text)
    text = re.sub(' *\n\n+', '\n\n', text)
    return text

This somewhat normalises newlines in the input before handing it to markdownify (assuming no <pre> tags are present) then post-processes the output to fix more whitespace issues, assuming a renderer that creates a hard linebreak when provided with a newline in the input paragraph (which is natural most comment forms etc. but also Fediverse clients, as it allows one to mostly post naturally).

Update: the snippet above breaks whitespace in pre tags, though. I have a more complex wrapper around Markdownify now; I guess I’ll have to put my patched Feediverse online some day.

May 30 '23 22:05 mirabilos

python-markdownify python-markdownify copied to clipboard

Newlines not collapsed from HTML

Example 1

Example 2

python-markdownify
python-markdownify copied to clipboard