Smart cover image discovering / Images extraction
Some RSS feeds does not include images. Implement feature to scan url and extract image. Images is important to understand context posts.
Hi!
Just to clarify, you mean automatically discovering a header/cover image from the original page of the news item? I.e. "smart cover image discovering", not some specific markup that contains a URL to the image in an RSS or ATOM feed.
@Tiendil you're correct. "Smart cover image discovering" is ideally explains the feature.
Some ideas about heuristic that may be used to find image
- check meta tags in head. It's weird, but some sites don't place image in RSS entry, but place it for search engines and social media. Maybe because of misconfiguration of RSS generators
- find block with the same (or most similar) text to body of RSS entry and check near nodes for images
- search for semantic tag names and structures. Something like first selector that match
main article img - additional checks for image sizes (to find image with largest size)
Good idea, worth implementing.
Currently, I cannot provide an estimated timeline, but I plan to prepare a roadmap of significant features for the project, and this one will be added as one of the subfeatures.
This task is related to gh-357 and gh-351