newspaper icon indicating copy to clipboard operation
newspaper copied to clipboard

poor top_image results (improve when dimension check on og:image added)

Open rahulbot opened this issue 3 years ago • 0 comments

We found that the top_image for a noticeable number of stories in a sample of news we were working on returned favicons. This happened on stories from popular and large websites like BBC News. After some trial I found that by changing the logic to only use og:image if it is large improves the results. Noting this here for anyone else that runs into this, and as a suggested change.

After investigating a bit more I found the algorithm seems to work like this psuedocode: " If og:image is set: use that as the top image no matter what else if the first image is big use that as a the top image else pick the largest dimensional image "

The problem we faced is that the first clause was picking small images in many articles, because they were set as the og:image. It appears that this change was made years ago (https://github.com/codelucas/newspaper/issues/96) due to some findings from German news sources.

I recommend changing the set_meta_img call from set_top_img_no_check to set_top_img (so it runs the satisfies_requirements dimensional check). We found this improves top_image results noticeably.

rahulbot avatar Jun 03 '21 15:06 rahulbot