python-markdownify More carefully separate inline text from block content

More carefully separate inline text from block content

Open jsm28 opened this issue 1 year ago • 2 comments

trafficstars

There are various cases in which inline text fails to be separated by (sufficiently many) newlines from adjacent block content. A paragraph needs a blank line (two newlines) separating it from prior text, as does an underlined header; an ATX header needs a single newline separating it from prior text. A list needs at least one newline separating it from prior text, but in general two newlines (for an ordered list starting other than at 1, which will only be recognized given a blank line before).

To avoid accumulation of more newlines than necessary, take care when concatenating the results of converting consecutive tags to remove redundant newlines (keeping the greater of the number ending the prior text and the number starting the subsequent text).

This is thus an alternative to #108 that tries to avoid the excess newline accumulation that was a concern there, as well as fixing more cases than just paragraphs, and updating tests.

Fixes #92

Fixes #98

Apr 09 '24 16:04 jsm28

Not sure that all suggested changes are good.

For instance, currently

md('<h3>\n\nHello</h3>') == '### Hello\n\n'

new behavior with that MR:

md('<h3>\n\nHello</h3>') == '\n### Hello\n\n'

I don't think that this is good approach. That "\n" values should be removed. There are no reason to preserve them in such case.

However the situation should be different if something precedes <h3> tag.

md('abc<h3>Hello</h3>') == 'abc\n\n### Hello\n\n'

Please not that leading "\n" here are not significant.

May 27 '24 17:05 alexei-osipov

If you want to avoid leading newlines at the start of the overall output, that could be done in convert_soup.

May 27 '24 17:05 jsm28

python-markdownify python-markdownify copied to clipboard

More carefully separate inline text from block content

python-markdownify
python-markdownify copied to clipboard