python-markdownify icon indicating copy to clipboard operation
python-markdownify copied to clipboard

Inconsistent handling of No-Break Space and Space

Open matiasfrndz opened this issue 11 months ago • 6 comments

I noticed an inconsistent handling of No-Break Space and Space. Observed behaviour using version 0.14.1: the following HTML:

<div>this is a <i>test&nbsp;</i>with whitespaces</div>
<div>this is a <i>test </i>with whitespaces</div>

Gets converted to:

this is a *test*with whitespaces
this is a *test* with whitespaces

Note that the &nbsp; in the first line gets lost during conversion.

I would expect that the conversion holds the same result for both lines. In fact I would expect the following result:

this is a *test *with whitespaces

matiasfrndz avatar Jan 01 '25 21:01 matiasfrndz

@matiasfrndz - This is indeed a bug. Thanks for reporting it!

There is a heuristic (implemented by the chomp() function) to move leading/trailing whitespace from inside style/link tags to outside, collapsing them to a single space in the process. For example, 1<code> 2 </code>3 becomes 1 <code>2</code> 3.

This bug occurs because the move-the-space-outside test considers only literal space characters, while the strip() call to remove whitespace inside also considers Unicode space characters.

Unicode actually offers many space characters:

\x20   Space
\xa0   No-Break Space
\x1680 Ogham Space Mark
\x2000 En Quad
\x2001 Em Quad
\x2002 En Space
\x2003 Em Space
\x2004 Three-Per-Em Space
\x2005 Four-Per-Em Space
\x2006 Six-Per-Em Space
\x2007 Figure Space
\x2008 Punctuation Space
\x2009 Thin Space
\x200a Hair Space
\x202f Narrow No-Break Space
\x205f Medium Mathematical Space
\x3000 Ideographic Space

and strip() will remove all of them.

Do you absolutely require non-breaking spaces to be left undisturbed, or can the fix convert any sequence of Unicode spaces inside the tag to a single standard space character outside the tag, e.g.:

this is a *test* with whitespaces

chrispy-snps avatar Jan 02 '25 18:01 chrispy-snps

We could regex replace ^\s+ and \s+$, capture these as groups and use them as pre/suffix instead of using strip. \s considers all kinds of whitespace, see https://docs.python.org/3/library/re.html

AlexVonB avatar Jan 02 '25 23:01 AlexVonB

@AlexVonB - I was also thinking along those lines, something like this:

prefix, text, suffix = re.match(r"^(\s*)(.*?)(\s*)$", text, flags=re.DOTALL).groups()

but then the question is, do we keep the collapsing behavior, or do we just shove the leading/trailing whitespace outside the tag exactly as-is, even if it is multiple characters or a mix of space-character types?

chrispy-snps avatar Jan 03 '25 00:01 chrispy-snps

@AlexVonB - after #162 is merged, I was thinking of cleaning up the string-joining code in process_tag() to improve whitespace cleanup when joining inline strings and newline cleanup when joining block strings.

For this issue, maybe we update chomp() to move the regex-captured leading/trailing whitespace as-is to outside the string. Then when I work on process_tag(), I'll try to optimize the whitespace there.

What do you think?

chrispy-snps avatar Jan 03 '25 02:01 chrispy-snps

I think this issue, #155 and #95 should all be dealt with together. #95 involves <br>, which gets converted to significant whitespace, so illustrating that at this stage it is indeed sometimes necessary to preserve the specific whitespace (space-space-newline from <br>, or backslash followed by newline if that style of handling <br> is chosen) rather than folding it all to spaces. Another example beyond those found in those issues is foo<b>\nbar\n</b>baz, which is currently converted to foo**bar**baz having lost the whitespace between words.

(Of course, when the tag only contains whitespace, it should only be extracted once to the parent level, not duplicated as both a prefix and a suffix.)

jsm28 avatar Jan 03 '25 17:01 jsm28

@jsm28 - thanks for pulling #95 into this. I think you're spot-on in how to proceed. Speaking for myself, I want to see #162 land first, then see if the solution presents itself to me. Of course, others are welcome to jump into this too. :)

chrispy-snps avatar Jan 03 '25 17:01 chrispy-snps

I wanted to document a potentially useful test case. This is related to whitespace handling but I'm not sure exactly what issue to attach it to. The following html:

<p><b>text1<br></b>text2</p>

converts to:

**text1**text2

This is unexpected, at least for me. The following, which changes the order of the closing b and br tags:

<p><b>text1</b><br>text2</p>

converts to:

**text1**  \ntext2

This is expected. While slightly "ugly" I believe the first example is still valid HTML and should produce something roughly equivalent to the second example.

sbrown61 avatar Apr 15 '25 15:04 sbrown61