python-markdownify
python-markdownify copied to clipboard
Inconsistent handling of No-Break Space and Space
I noticed an inconsistent handling of No-Break Space and Space. Observed behaviour using version 0.14.1: the following HTML:
<div>this is a <i>test </i>with whitespaces</div>
<div>this is a <i>test </i>with whitespaces</div>
Gets converted to:
this is a *test*with whitespaces
this is a *test* with whitespaces
Note that the in the first line gets lost during conversion.
I would expect that the conversion holds the same result for both lines. In fact I would expect the following result:
this is a *test *with whitespaces
@matiasfrndz - This is indeed a bug. Thanks for reporting it!
There is a heuristic (implemented by the chomp() function) to move leading/trailing whitespace from inside style/link tags to outside, collapsing them to a single space in the process. For example, 1<code> 2 </code>3 becomes 1 <code>2</code> 3.
This bug occurs because the move-the-space-outside test considers only literal space characters, while the strip() call to remove whitespace inside also considers Unicode space characters.
Unicode actually offers many space characters:
\x20 Space
\xa0 No-Break Space
\x1680 Ogham Space Mark
\x2000 En Quad
\x2001 Em Quad
\x2002 En Space
\x2003 Em Space
\x2004 Three-Per-Em Space
\x2005 Four-Per-Em Space
\x2006 Six-Per-Em Space
\x2007 Figure Space
\x2008 Punctuation Space
\x2009 Thin Space
\x200a Hair Space
\x202f Narrow No-Break Space
\x205f Medium Mathematical Space
\x3000 Ideographic Space
and strip() will remove all of them.
Do you absolutely require non-breaking spaces to be left undisturbed, or can the fix convert any sequence of Unicode spaces inside the tag to a single standard space character outside the tag, e.g.:
this is a *test* with whitespaces
We could regex replace ^\s+ and \s+$, capture these as groups and use them as pre/suffix instead of using strip. \s considers all kinds of whitespace, see https://docs.python.org/3/library/re.html
@AlexVonB - I was also thinking along those lines, something like this:
prefix, text, suffix = re.match(r"^(\s*)(.*?)(\s*)$", text, flags=re.DOTALL).groups()
but then the question is, do we keep the collapsing behavior, or do we just shove the leading/trailing whitespace outside the tag exactly as-is, even if it is multiple characters or a mix of space-character types?
@AlexVonB - after #162 is merged, I was thinking of cleaning up the string-joining code in process_tag() to improve whitespace cleanup when joining inline strings and newline cleanup when joining block strings.
For this issue, maybe we update chomp() to move the regex-captured leading/trailing whitespace as-is to outside the string. Then when I work on process_tag(), I'll try to optimize the whitespace there.
What do you think?
I think this issue, #155 and #95 should all be dealt with together. #95 involves <br>, which gets converted to significant whitespace, so illustrating that at this stage it is indeed sometimes necessary to preserve the specific whitespace (space-space-newline from <br>, or backslash followed by newline if that style of handling <br> is chosen) rather than folding it all to spaces. Another example beyond those found in those issues is foo<b>\nbar\n</b>baz, which is currently converted to foo**bar**baz having lost the whitespace between words.
(Of course, when the tag only contains whitespace, it should only be extracted once to the parent level, not duplicated as both a prefix and a suffix.)
@jsm28 - thanks for pulling #95 into this. I think you're spot-on in how to proceed. Speaking for myself, I want to see #162 land first, then see if the solution presents itself to me. Of course, others are welcome to jump into this too. :)
I wanted to document a potentially useful test case. This is related to whitespace handling but I'm not sure exactly what issue to attach it to. The following html:
<p><b>text1<br></b>text2</p>
converts to:
**text1**text2
This is unexpected, at least for me. The following, which changes the order of the closing b and br tags:
<p><b>text1</b><br>text2</p>
converts to:
**text1** \ntext2
This is expected. While slightly "ugly" I believe the first example is still valid HTML and should produce something roughly equivalent to the second example.