mf2py icon indicating copy to clipboard operation
mf2py copied to clipboard

retain authored HTML for empty elements

Open kartikprabhu opened this issue 7 years ago • 8 comments

Currently mf2py due to using BeautifulSoup closes empty HTML tags. e.g. <br> gets converted to <br/> and <hr> gets converted into <hr/>. This makes the e-content[html] different from the authored one.

This does not seem to be an issue in actual use but will be for any tests. So I am documenting this here.

Details

html5lib by default does not do this see: https://github.com/html5lib/html5lib-python/blob/5e6b61b4630165dd4765fff41d0f855534d5e2fe/html5lib/serializer.py#L114

The relevant lines in BeautifulSoup which explicitly do this are https://github.com/waylan/beautifulsoup/blob/480367ce8c8a4d1ada3012a95f0b5c2cce4cf497/bs4/element.py#L1106-L1107 (Note that this is not the canonial source for BS4)

kartikprabhu avatar Mar 05 '18 04:03 kartikprabhu

Note that “retain” is actually the wrong behaviour. You should normalise to having no closing trailing solidus. The microformats parsing specification for e- says to fill the html property using HTML’s serialising algorithm, which never includes it.

Zegnat avatar Mar 08 '18 11:03 Zegnat

BS4 recently added an "html5" formatter that apparently does this: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/NEWS.txt#L14

I propose switching output to that, since we do not make any other promise about the HTML output AFAIK?

sknebel avatar Oct 01 '18 20:10 sknebel

@snarfed While investigating another issue I discovered today that granary appears to rely on mf2py to produce somewhat XHTML-compatible output, which this would break. We could expose the HTML formatter in the API to allow granary to force the old behavior?

EDIT: alternatively, some variation of #97 that allows granary to force the serialization downstream, which might give it more options to generate proper XML.

sknebel avatar Oct 05 '18 17:10 sknebel

thanks for the heads up! i'm not entirely sure how this would affect granary yet, but i can always update it. let me know when you have an mf2py PR you want me to test!

snarfed avatar Oct 08 '18 13:10 snarfed

#136 is said PR. Granary produces Atom feeds with the content declared to be XHTML and unless I missed something taken straight from mf2 parsing when turning an mf2 html feed into Atom. Not closing void elements isn't allowed in XHTML as far as I know, and would make those feeds invalid?

sknebel avatar Oct 08 '18 13:10 sknebel

thanks! yeah, i got that part, i just don't remember the exact transformation steps in granary. no matter, i'll try it and see.

snarfed avatar Oct 08 '18 13:10 snarfed

@sknebel you're right. thanks for thinking of granary! it does pass the HTML content pretty much straight through to Atom.

ideally yes, i'd love a flag or exposed HTML formatter in mf2py so i can control this in granary. i tested just now though, and if i change the Atom to <content type="html">, and fully escape it, HTML5 content validates ok and renders ok in readers. so i can handle whatever you all end up choosing.

snarfed avatar Oct 08 '18 14:10 snarfed

If you use html5lib you can reserialize with options turned on - I think use_trailing_solidus is the one to autoclose null elements; I don't know that it can guarantee full XHTML compliance though, which is what you need to inline them in Atom. Reparsing and serializing each html chunk might not be optimal though.

kevinmarks avatar Dec 19 '18 12:12 kevinmarks