Lists consisting of mostly links get removed
Expected Behavior
Postlight Parser should preserve all the actual content of the page.
Current Behavior
Postlight Parser will get rid of any bulleted / numbered lists which consist mostly of links.
Steps to Reproduce
Run Postlight Parser on https://faultlore.com/blah/defaults-affect-inference. The bulleted list a bit after the 'Some Wild Shit Swift Does' heading gets removed.
Picture of the list in question:
Detailed Description
This is the code that causes the problem:
https://github.com/postlight/parser/blob/e8ba7ece291efa4d915d50dd4deeec17d54359f2/src/utils/dom/clean-tags.js#L43-L73
It's aiming to try and get rid of menus and things.
Possible Solution
The easiest solution would be to also apply the special case from the weight >= 25 bit of the code above to the weight < 25 bit of the code, which keeps any list that comes after a paragraph ending in a colon. (The lists which don't work fall into the weight < 25 camp, which is why they don't already work thanks to that special case.)
Another solution I thought of would be to look at either the average or maximum length of links in a list (or table / div / everything else that the tag-cleaning code gets applied to), and if it's longer than some threshold include it. In theory that should differentiate between shorter links in menus and longer sentence-length links in content; but looking at the example I provided again those links are actually quite short so that might not work as well as I'd hoped.
So yeah, probably that first solution. I've already implemented it at https://github.com/Liamolucko/postlight-parser/tree/fix-link-lists and confirmed that it works.