mf2py icon indicating copy to clipboard operation
mf2py copied to clipboard

expose BS4 tree for sanitation

Open sknebel opened this issue 6 years ago • 2 comments

Given that for many use cases HTML is going to be sanitized after being extracted from a page (e.g. to be displayed as a comment), it could make sense to optionally return the nodes from the parsed beautifulsoup tree instead of turning them into HTML strings that'll immediately afterwards be parsed again.

Alternatively, we could add a hook to modify the tree before serialization inside mf2py.

Thoughts?

sknebel avatar Mar 10 '18 14:03 sknebel

@sknebel the BS4 tree is already exposed for advanced users through the Parser.__doc__ property. You can also give a BS4 tree directly as the doc argument to parse; i.e. you can use BS4 to first sanitise your HTML and then give it to mf2py. (This is the approach I use)

kartikprabhu avatar Mar 10 '18 15:03 kartikprabhu

For the first, you'd then have to do find the e--elements yourself and match them. The second is a possibility, although you'd have to be careful not to destroy relevant data in the process (e.g. classes would commonly be wiped by a sanitizer step)

sknebel avatar Mar 10 '18 15:03 sknebel

@sknebel Do you have a specific use case in mind in which this would be helpful? Is this still something that would be helpful to you as a consumer of microformats?

capjamesg avatar Jul 06 '23 20:07 capjamesg

as mentioned, comment display. (really, any scenario that will display the extracted HTML will need sanitizing and thus can make use of this)

sknebel avatar Jul 10 '23 22:07 sknebel

@sknebel do you mean something like this:

<div class=h-entry>
in reply to: <a class=u-in-reply-to href=//example.com/foo>Main thread</a>
<div class=e-content>This is the <code>&lt;body&gt;</code> of a comment.</div>
</div>

parses to:

{
  'items': [{
    'type': ['h-entry'],
    'properties': {'in-reply-to': ['//example.com/foo'],
    'content': [{
      'html': 'This is the <code>&lt;body&gt;</code> of a comment.',
      'value': 'This is the <body> of a comment.',
      'dom': <bs4.element.Tag>}]}}],
  ...
}

Note the new dom key and what would be the associated BeautifulSoup4 Tag object.

angelogladding avatar Jul 12 '23 02:07 angelogladding

Something like that, although personally, I'd leave out the html key with such a flag turned on, since it is specifically for "I want to do further things you don't know about with the HTML" use cases (and on larger documents, re-serializing the HTML probably takes noticeable amounts of time)

sknebel avatar Jul 12 '23 10:07 sknebel