readabilitySAX icon indicating copy to clipboard operation
readabilitySAX copied to clipboard

Article images need better detection

Open mrjjwright opened this issue 12 years ago • 2 comments

There are a few times where Safari Reader is doing a better job of leaving in article images that are filtered out by readabilitySAX. Here is an example: http://hommemaker.com/2012/08/20/why-the-gays-hate-their-bodies/. Compare the Safari Reader rendering with readabilitySAX. In this case readabilitySAX should preserve images that are wrapped inside a a parent and p grandparent tag. The general rule might be that if there is a single image of sufficient size with any number of wrapping tags these images are candidates. There is probably a better general rule, that is just my take on it.

mrjjwright avatar Aug 23 '12 14:08 mrjjwright

The problem is that banner ads often are big images inside an a tag. I was really annoyed by the number of banner ads I got, so I added this rule. In retrospective, it looks a bit harsh.

The ideal solution in terms of the result would be to use a list of ads and check every image if it matches a rule (Adblock Plus-alike). But this would probably harm the performance in a terrible way and also requires to be updated quite often.

Another option would be to filter images based on their aspect ratio. But not all images have their width & height specified, which complicates this.

I guess we'll have to live with either banner ads or missing images. Missing images might hamper the understanding of an article, while banner ads can be ignored. The choice seems to be pretty obvious, so I'll change the behavior of readabilitySAX soon.

fb55 avatar Aug 23 '12 17:08 fb55

The easiest way is to make it optional and to provide optional middleware for image size detection. I can share my private code which is doing this as example. Reading just view bytes from images makes it relatively fast.

kof avatar Jul 17 '14 22:07 kof