mf2py
mf2py copied to clipboard
retain authored HTML for empty elements
Currently mf2py due to using BeautifulSoup closes empty HTML tags. e.g. <br> gets converted to <br/> and <hr> gets converted into <hr/>. This makes the e-content[html] different from the authored one.
This does not seem to be an issue in actual use but will be for any tests. So I am documenting this here.
Details
html5lib by default does not do this see: https://github.com/html5lib/html5lib-python/blob/5e6b61b4630165dd4765fff41d0f855534d5e2fe/html5lib/serializer.py#L114
The relevant lines in BeautifulSoup which explicitly do this are https://github.com/waylan/beautifulsoup/blob/480367ce8c8a4d1ada3012a95f0b5c2cce4cf497/bs4/element.py#L1106-L1107 (Note that this is not the canonial source for BS4)
Note that “retain” is actually the wrong behaviour. You should normalise to having no closing trailing solidus. The microformats parsing specification for e- says to fill the html property using HTML’s serialising algorithm, which never includes it.
BS4 recently added an "html5" formatter that apparently does this: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/NEWS.txt#L14
I propose switching output to that, since we do not make any other promise about the HTML output AFAIK?
@snarfed While investigating another issue I discovered today that granary appears to rely on mf2py to produce somewhat XHTML-compatible output, which this would break. We could expose the HTML formatter in the API to allow granary to force the old behavior?
EDIT: alternatively, some variation of #97 that allows granary to force the serialization downstream, which might give it more options to generate proper XML.
thanks for the heads up! i'm not entirely sure how this would affect granary yet, but i can always update it. let me know when you have an mf2py PR you want me to test!
#136 is said PR. Granary produces Atom feeds with the content declared to be XHTML and unless I missed something taken straight from mf2 parsing when turning an mf2 html feed into Atom. Not closing void elements isn't allowed in XHTML as far as I know, and would make those feeds invalid?
thanks! yeah, i got that part, i just don't remember the exact transformation steps in granary. no matter, i'll try it and see.
@sknebel you're right. thanks for thinking of granary! it does pass the HTML content pretty much straight through to Atom.
ideally yes, i'd love a flag or exposed HTML formatter in mf2py so i can control this in granary. i tested just now though, and if i change the Atom to <content type="html">, and fully escape it, HTML5 content validates ok and renders ok in readers. so i can handle whatever you all end up choosing.
If you use html5lib you can reserialize with options turned on - I think use_trailing_solidus is the one to autoclose null elements; I don't know that it can guarantee full XHTML compliance though, which is what you need to inline them in Atom. Reparsing and serializing each html chunk might not be optimal though.