readabilitySAX
readabilitySAX copied to clipboard
Article images need better detection
There are a few times where Safari Reader is doing a better job of leaving in article images that are filtered out by readabilitySAX. Here is an example: http://hommemaker.com/2012/08/20/why-the-gays-hate-their-bodies/. Compare the Safari Reader rendering with readabilitySAX. In this case readabilitySAX should preserve images that are wrapped inside a a
parent and p
grandparent tag. The general rule might be that if there is a single image of sufficient size with any number of wrapping tags these images are candidates. There is probably a better general rule, that is just my take on it.
The problem is that banner ads often are big images inside an a
tag. I was really annoyed by the number of banner ads I got, so I added this rule. In retrospective, it looks a bit harsh.
The ideal solution in terms of the result would be to use a list of ads and check every image if it matches a rule (Adblock Plus-alike). But this would probably harm the performance in a terrible way and also requires to be updated quite often.
Another option would be to filter images based on their aspect ratio. But not all images have their width & height specified, which complicates this.
I guess we'll have to live with either banner ads or missing images. Missing images might hamper the understanding of an article, while banner ads can be ignored. The choice seems to be pretty obvious, so I'll change the behavior of readabilitySAX soon.
The easiest way is to make it optional and to provide optional middleware for image size detection. I can share my private code which is doing this as example. Reading just view bytes from images makes it relatively fast.