parser icon indicating copy to clipboard operation
parser copied to clipboard

Lists consisting of mostly links get removed

Open Liamolucko opened this issue 2 years ago • 0 comments

Expected Behavior

Postlight Parser should preserve all the actual content of the page.

Current Behavior

Postlight Parser will get rid of any bulleted / numbered lists which consist mostly of links.

Steps to Reproduce

Run Postlight Parser on https://faultlore.com/blah/defaults-affect-inference. The bulleted list a bit after the 'Some Wild Shit Swift Does' heading gets removed.

Picture of the list in question:

Screenshot 2023-08-08 at 9 10 09 pm

Detailed Description

This is the code that causes the problem:

https://github.com/postlight/parser/blob/e8ba7ece291efa4d915d50dd4deeec17d54359f2/src/utils/dom/clean-tags.js#L43-L73

It's aiming to try and get rid of menus and things.

Possible Solution

The easiest solution would be to also apply the special case from the weight >= 25 bit of the code above to the weight < 25 bit of the code, which keeps any list that comes after a paragraph ending in a colon. (The lists which don't work fall into the weight < 25 camp, which is why they don't already work thanks to that special case.)

Another solution I thought of would be to look at either the average or maximum length of links in a list (or table / div / everything else that the tag-cleaning code gets applied to), and if it's longer than some threshold include it. In theory that should differentiate between shorter links in menus and longer sentence-length links in content; but looking at the example I provided again those links are actually quite short so that might not work as well as I'd hoped.

So yeah, probably that first solution. I've already implemented it at https://github.com/Liamolucko/postlight-parser/tree/fix-link-lists and confirmed that it works.

Liamolucko avatar Aug 08 '23 11:08 Liamolucko