html2text icon indicating copy to clipboard operation
html2text copied to clipboard

Extra space after a closing emphasis mark

Open ropery opened this issue 1 year ago • 1 comments

$ echo '<em>hello</em>'{\,,\",:,\[,.,\!,\?}'<br>' | html2text
_hello_ ,  
_hello_ "  
_hello_ :  
_hello_[  
_hello_.  
_hello_!  
_hello_?  

Note in the first three lines of the output, there is an extra space after the closing _ emphasis mark.

This is a bug, because Markdown has no problem with a punctuation immediately following the closing emphasis mark:

$ echo _hello_{\,,\",:,\[,.,\!,\?} | markdown
<p><em>hello</em>, <em>hello</em>&ldquo; <em>hello</em>: <em>hello</em>[ <em>hello</em>. <em>hello</em>! <em>hello</em>?</p>

The same rendered by GitHub: hello, hello" hello: hello[ hello. hello! hello?

I guess the extra space is added here:

https://github.com/Alir3z4/html2text/blob/099c4b8bfeea09d640e18324bb1d44f051371940/html2text/init.py#L295-L297

Or here, which explains why the bottom four results don't have the extra space:

https://github.com/Alir3z4/html2text/blob/099c4b8bfeea09d640e18324bb1d44f051371940/html2text/init.py#L860-L868

ropery avatar Dec 24 '23 14:12 ropery

I would like to add, that maybe we should simply not add extra spaces around stressed text:

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}"; done
_foo_bar_baz_
*foo*bar*baz*
__foo__bar__baz__
**foo**bar**baz**

My markdown produces:

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}" | markdown; done
<p><em>foo_bar_baz</em></p>
<p><em>foo</em>bar<em>baz</em></p>
<p><strong>foo</strong>bar<strong>baz</strong></p>
<p><strong>foo</strong>bar<strong>baz</strong></p>

But GitHub's rendering disagrees for the third __foo__bar__baz__: foo_bar_baz foobarbaz foo__bar__baz foobarbaz

$ for i in _ \* __ \*\*; do echo "${i}foo${i}bar${i}baz${i}" | markdown | html2text; done
_foo_bar_baz_

_foo_ bar _baz_

**foo** bar**baz**

**foo** bar**baz**

So it seems, if we want to add extra spaces, it would be only when the stress mark is _ or __ -- * and ** don't require extra spaces for Markdown to apply the stress, e.g., ***a**b* -> ab = ok

-- which leads to the question: should -e be the default, or maybe automatically use * in where _ would require extra spaces (thereby irreversibly distorting the text).

ropery avatar Dec 24 '23 15:12 ropery