Use bleach for HTML sanitizing
This will likely be a breaking change.
feedparser's HTML sanitizing should not rely on custom internal code anymore. Using an external package like bleach will allow feedparser to focus more closely on feed parsing, and allow developers to consolidate HTML sanitizing efforts so that everyone benefits.
Interestingly, browsing the source code for Mozilla's bleach module links to the WHAT-WG documentation which states that its early work was based on feedparser's HTML sanitizing, so it appears that things have come full circle in the last ~17 years!
Early testing suggests that this will affect feedparser's output so that it is HTML5 but perhaps not XHTML or HTML4. For example, quotes may or may not always be used with element attributes.
Is there already work on this feature that could be tested?
Not yet. It looks like it will take some effort and customization because bleach has extremely strict defaults.
@kurtmckee in looking at the code of BaseHTMLProcessor it looks to have quite a bit of a mixup between the sanitisation and the loose parsing support, would better splitting those up (and possibly removing some of the dead code which looks to be in it) be a good first step?
Playing around it looks like all the tests pass if all the handle_, unknown_, and convert_ methods are removed from the LooseFeedParser as XMLParserMixin tends to implement those and not delegate (call super()).
And then HTMLSanitizer overrides a few methods which BaseHTMLProcessor defines/overrides, either with or without delegation.
Those without delegation mean the one in BaseHTMLProcessor is dead code, and the rest might be opportunities for simplification (or dead code as well).
A heads up — you probably don't to go down this road, as Bleach is deprecated because html5lib is not actively maintained (it's a bit circular 🙃).
CC @lemon24 as this was also discussed in #296.