python-markdownify Inconsistent handling of No-Break Space and Space

I noticed an inconsistent handling of No-Break Space and Space. Observed behaviour using version 0.14.1: the following HTML:

<div>this is a <i>test&nbsp;</i>with whitespaces</div>
<div>this is a <i>test </i>with whitespaces</div>

Gets converted to:

this is a *test*with whitespaces
this is a *test* with whitespaces

Note that the   in the first line gets lost during conversion.

I would expect that the conversion holds the same result for both lines. In fact I would expect the following result:

this is a *test *with whitespaces

Jan 01 '25 21:01 matiasfrndz

@matiasfrndz - This is indeed a bug. Thanks for reporting it!

There is a heuristic (implemented by the chomp() function) to move leading/trailing whitespace from inside style/link tags to outside, collapsing them to a single space in the process. For example, 1<code> 2 </code>3 becomes 1 <code>2</code> 3.

This bug occurs because the move-the-space-outside test considers only literal space characters, while the strip() call to remove whitespace inside also considers Unicode space characters.

Unicode actually offers many space characters:

\x20   Space
\xa0   No-Break Space
\x1680 Ogham Space Mark
\x2000 En Quad
\x2001 Em Quad
\x2002 En Space
\x2003 Em Space
\x2004 Three-Per-Em Space
\x2005 Four-Per-Em Space
\x2006 Six-Per-Em Space
\x2007 Figure Space
\x2008 Punctuation Space
\x2009 Thin Space
\x200a Hair Space
\x202f Narrow No-Break Space
\x205f Medium Mathematical Space
\x3000 Ideographic Space

and strip() will remove all of them.

Do you absolutely require non-breaking spaces to be left undisturbed, or can the fix convert any sequence of Unicode spaces inside the tag to a single standard space character outside the tag, e.g.:

this is a *test* with whitespaces

Jan 02 '25 18:01 chrispy-snps

We could regex replace ^\s+ and \s+$, capture these as groups and use them as pre/suffix instead of using strip. \s considers all kinds of whitespace, see https://docs.python.org/3/library/re.html

Jan 02 '25 23:01 AlexVonB

@AlexVonB - I was also thinking along those lines, something like this:

prefix, text, suffix = re.match(r"^(\s*)(.*?)(\s*)$", text, flags=re.DOTALL).groups()

but then the question is, do we keep the collapsing behavior, or do we just shove the leading/trailing whitespace outside the tag exactly as-is, even if it is multiple characters or a mix of space-character types?

Jan 03 '25 00:01 chrispy-snps

@AlexVonB - after #162 is merged, I was thinking of cleaning up the string-joining code in process_tag() to improve whitespace cleanup when joining inline strings and newline cleanup when joining block strings.

For this issue, maybe we update chomp() to move the regex-captured leading/trailing whitespace as-is to outside the string. Then when I work on process_tag(), I'll try to optimize the whitespace there.

What do you think?

Jan 03 '25 02:01 chrispy-snps

I think this issue, #155 and #95 should all be dealt with together. #95 involves  , which gets converted to significant whitespace, so illustrating that at this stage it is indeed sometimes necessary to preserve the specific whitespace (space-space-newline from  , or backslash followed by newline if that style of handling   is chosen) rather than folding it all to spaces. Another example beyond those found in those issues is foo\nbar\nbaz, which is currently converted to foo**bar**baz having lost the whitespace between words.

(Of course, when the tag only contains whitespace, it should only be extracted once to the parent level, not duplicated as both a prefix and a suffix.)

Jan 03 '25 17:01 jsm28

@jsm28 - thanks for pulling #95 into this. I think you're spot-on in how to proceed. Speaking for myself, I want to see #162 land first, then see if the solution presents itself to me. Of course, others are welcome to jump into this too. :)

Jan 03 '25 17:01 chrispy-snps

I wanted to document a potentially useful test case. This is related to whitespace handling but I'm not sure exactly what issue to attach it to. The following html:

<p><b>text1<br></b>text2</p>

converts to:

**text1**text2

This is unexpected, at least for me. The following, which changes the order of the closing b and br tags:

<p><b>text1</b><br>text2</p>

converts to:

**text1**  \ntext2

This is expected. While slightly "ugly" I believe the first example is still valid HTML and should produce something roughly equivalent to the second example.

Apr 15 '25 15:04 sbrown61

python-markdownify python-markdownify copied to clipboard

Inconsistent handling of No-Break Space and Space

python-markdownify
python-markdownify copied to clipboard