Performance Issues with Image Fetch

Open AndyTheFactory opened this issue 2 years ago • 0 comments

Issue by owaaa Wed Dec 6 12:43:26 2017 Originally opened as https://github.com/codelucas/newspaper/issues/483

Overview

I am experiencing issues where scrapes that would otherwise take 500 milis to 1s, that are taking 15s to 30s or more when including fetch_images=True as part of an article parse (also the default behavior). When scraping many sites this also becomes a synchronous bottleneck as the requests library is embedded into the Image process and the downloading of images is not separated from the assessment of images fitness as the "top image". Additionally, if you set fetch_images=False, newspaper bypasses all image logic and no image_urls are returned, which may still otherwise be have been useful.

Expected Behavior

As image downloading seems to be a primary driver of time for many websites, I think newspaper should allow greater flexibility in configuration of the image extraction process, including the ability to limit the upper bounds of the number of images checked, provide an option to return parsed image urls "unchecked", and separate the image download process from image checking so that external techniques or asyncio can optionally be used for downloading similar to how article.set_html can be used to set externally downloaded content for parsing.

Potential Resolutions

I am likely going to need to address this for my own needs, so would like some feedback on potential approaches, to maximize potential for a future PR. Here are some things I am considering:

Modify fetch_images to only control "fetching" (leave defaulted to True), and add maybe add an additional parameter provide_unchecked_images, defaulted to True to control if image_urls are returned when not fetched.
Add a parameter max_images_fetched that optionally limits the number of article images considered, similar to how there are settings for article length, minimum sentences etc. This would alleviate issues on sites with dozens of images.
Separate the Image.Scrape assessment and download logic. Ideally find a way to provide images for assessment which can be optionally be downloaded externally in asyncio and provided to a parse phase similar to how set_html can be used. This unfortunately doesn't look as straight foward yet as the top two options. Worst case option 1, allows a workaround where someone could download images and call into newspaper internal Image methods.
Long Term: I have a question if other techniques for assessment that may be less expensive have been considered (even as options) like looking for images embedded in top_node (or very proximate to top node) etc.

Oct 24 '23 10:10 AndyTheFactory