readabilitySAX icon indicating copy to clipboard operation
readabilitySAX copied to clipboard

Handling double images

Open mrjjwright opened this issue 12 years ago • 5 comments

Many sites have an enlarge javascript function to see a larger image. E.g. NPR often does this: http://www.npr.org/blogs/therecord/2012/08/22/159534467/my-american-dream-sounds-like-black-star?sc=fb&cc=fmp.

Safari Reader and Readability manage to filter out the double images somehow but readabilitySAX is showing the same image twice at the top of the article. I am trying to figure out how to fix this but thought I would post an issue until I figure it out.

mrjjwright avatar Aug 23 '12 16:08 mrjjwright

That's the nice thing when you've got a rendered page: You may simply remove everything that isn't visible (for jQuery users: $(":not(:visible)").remove()).

I can't tell by looking at the markup which parts will be hidden, so I'm closing this as I don't see a way to fix it.

fb55 avatar Aug 23 '12 16:08 fb55

Well what about this in onclosetag? The idea here would be to only allow one image per level in the readable article. I tested this and it works well on NPR articles except it doesn't remove duplicate captions.

if (tagName === 'img') {
  // don't double up on showing  images
  if (this._imgElem) {
    elem.parent.isCandidate = false;
    // skip the img
    return;
}
  else this._imgElem = elem.parent;   
}

if (this._imgElem === elem) {
  this._imgElem = null;
}

mrjjwright avatar Aug 23 '12 17:08 mrjjwright

This would solve this particular problem, but would raise others: Just think about image galeries. And I've seen websites that wrapped their content in <pre> tags, with image tags at random locations. I don't want to break these, even though their markup is terrible. But you may argue the same for npr.org.

Besides, isCandidate only signals that an element is a candidate for being the container of the article ;)

fb55 avatar Aug 23 '12 17:08 fb55

Okay, first, you should use a label for this code. The boolean isn't needed and complicates everything. Besides, you remove all nodes that aren't <img>s, which isn't such a good idea.

I guess the best variant would be to filter images based on their URLs, in a similar way as links are already handled. When you find some usable rules for comparing URLs, that would be great.

fb55 avatar Aug 24 '12 11:08 fb55

I got rid of that code.

I instead keep track of the largest image within each div and if the div is too low on contentLength simply use that image as the div's children. This is taking care of enlarge divs and the like. I filter out anything with slideshow explicitly. It's working well.

On Aug 24, 2012, at 5:32 AM, Felix Böhm [email protected] wrote:

Okay, first, you should use a label for this code. The boolean isn't needed and complicates everything. Besides, you remove all nodes that aren't s, which isn't such a good idea.

I guess the best variant would be to filter images based on their URLs, in a similar way as links are already handled. When you find some usable rules for comparing URLs, that would be great.

— Reply to this email directly or view it on GitHub.

mrjjwright avatar Aug 24 '12 19:08 mrjjwright