glance icon indicating copy to clipboard operation
glance copied to clipboard

RSS scrape image from article if otherwise none is found

Open rakicjovan opened this issue 7 months ago • 4 comments

Some RSS Feeds don't provide a media tag and the other failsafes to try and get an image to display also sometimes fail, an example would be the RSS Feed of Bleeping Computer: https://www.bleepingcomputer.com/feed/

This PR implements another failover which uses the worker pool to scrape an image from the article itself. It uses the CSS selectors article img, main img, and .post-content img to look for images typically found in article content. The first element found within these selectors that has a valid src attribute is used as the preview image.

Tested using Docker and the case of Bleeping Computer where else no pictures are displayed, loading times don't seem affected thanks to the worker pool. Keep in mind to change the User Agent if trying to replicate with Bleeping Computer, otherwise you will be blocked by Cloudflare.

Before: image

After: image

rakicjovan avatar May 01 '25 21:05 rakicjovan

@rakicjovan Can you tell me how to fix this? What config do I need to set to generate images?

dhanadhan avatar May 18 '25 05:05 dhanadhan

@dhanadhan You'll have to clone my fork of the repo and build the docker image or the binary. My config for the BleepingComputer RSS looks like this:

- type: rss
            style: detailed-list
            limit: 15
            collapse-after: 3
            feeds:
              - url: https://www.bleepingcomputer.com/feed/
                headers:
                  User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3
                title: Bleeping Computer

I found that if I don't change the User-Agent, glance gets blocked by Cloudflare. Maybe you'll have to limit it to 10 items from BleepingComputer to not get rate limited.

rakicjovan avatar May 18 '25 12:05 rakicjovan

While this is potentially a big improvement in some cases, personally I feel like if a feed doesn't provide proper thumbnails then just let it be. The RSS widget is already one of the slowest things to load and this would exacerbate the issue further. There was another PR which adds fallback thumbnails, and that wouldn't be as good as this, but it's much simpler less costly to add.

svilenmarkov avatar May 19 '25 20:05 svilenmarkov

@svilenmarkov I've been daily driving it and haven't experienced any issues regarding loading times thanks to the efficiency of go and the worker pool. As you said it could potentially be a big improvement for people using feeds which don't provide the image directly in the feed. How about an option in the yaml config like "scrape-image" with a default value of false, so the user would need to explicitly set the option if needed.

rakicjovan avatar May 19 '25 22:05 rakicjovan

@svilenmarkov RSS image scraping is now disabled by default. It must be explicitly enabled per RSS feed via the config file. The configuration documentation has been updated accordingly.

rakicjovan avatar Jun 06 '25 17:06 rakicjovan