newspaper
newspaper copied to clipboard
poor top_image results (improve when dimension check on og:image added)
We found that the top_image
for a noticeable number of stories in a sample of news we were working on returned favicons. This happened on stories from popular and large websites like BBC News. After some trial I found that by changing the logic to only use og:image
if it is large improves the results. Noting this here for anyone else that runs into this, and as a suggested change.
After investigating a bit more I found the algorithm seems to work like this psuedocode:
"
If og:image
is set: use that as the top image no matter what
else if the first image is big use that as a the top image
else pick the largest dimensional image
"
The problem we faced is that the first clause was picking small images in many articles, because they were set as the og:image
. It appears that this change was made years ago (https://github.com/codelucas/newspaper/issues/96) due to some findings from German news sources.
I recommend changing the set_meta_img
call from set_top_img_no_check
to set_top_img
(so it runs the satisfies_requirements
dimensional check). We found this improves top_image
results noticeably.