mf2py
mf2py copied to clipboard
expose BS4 tree for sanitation
Given that for many use cases HTML is going to be sanitized after being extracted from a page (e.g. to be displayed as a comment), it could make sense to optionally return the nodes from the parsed beautifulsoup tree instead of turning them into HTML strings that'll immediately afterwards be parsed again.
Alternatively, we could add a hook to modify the tree before serialization inside mf2py.
Thoughts?
@sknebel the BS4 tree is already exposed for advanced users through the Parser.__doc__
property. You can also give a BS4 tree directly as the doc
argument to parse; i.e. you can use BS4 to first sanitise your HTML and then give it to mf2py. (This is the approach I use)
For the first, you'd then have to do find the e-
-elements yourself and match them. The second is a possibility, although you'd have to be careful not to destroy relevant data in the process (e.g. classes would commonly be wiped by a sanitizer step)
@sknebel Do you have a specific use case in mind in which this would be helpful? Is this still something that would be helpful to you as a consumer of microformats?
as mentioned, comment display. (really, any scenario that will display the extracted HTML will need sanitizing and thus can make use of this)
@sknebel do you mean something like this:
<div class=h-entry>
in reply to: <a class=u-in-reply-to href=//example.com/foo>Main thread</a>
<div class=e-content>This is the <code><body></code> of a comment.</div>
</div>
parses to:
{
'items': [{
'type': ['h-entry'],
'properties': {'in-reply-to': ['//example.com/foo'],
'content': [{
'html': 'This is the <code><body></code> of a comment.',
'value': 'This is the <body> of a comment.',
'dom': <bs4.element.Tag>}]}}],
...
}
Note the new dom
key and what would be the associated BeautifulSoup4 Tag object.
Something like that, although personally, I'd leave out the html
key with such a flag turned on, since it is specifically for "I want to do further things you don't know about with the HTML" use cases (and on larger documents, re-serializing the HTML probably takes noticeable amounts of time)