Performance Issues with Image Fetch
Issue by owaaa
Wed Dec 6 12:43:26 2017
Originally opened as https://github.com/codelucas/newspaper/issues/483
Overview
I am experiencing issues where scrapes that would otherwise take 500 milis to 1s, that are taking 15s to 30s or more when including fetch_images=True as part of an article parse (also the default behavior). When scraping many sites this also becomes a synchronous bottleneck as the requests library is embedded into the Image process and the downloading of images is not separated from the assessment of images fitness as the "top image". Additionally, if you set fetch_images=False, newspaper bypasses all image logic and no image_urls are returned, which may still otherwise be have been useful.
Expected Behavior
As image downloading seems to be a primary driver of time for many websites, I think newspaper should allow greater flexibility in configuration of the image extraction process, including the ability to limit the upper bounds of the number of images checked, provide an option to return parsed image urls "unchecked", and separate the image download process from image checking so that external techniques or asyncio can optionally be used for downloading similar to how article.set_html can be used to set externally downloaded content for parsing.
Potential Resolutions
I am likely going to need to address this for my own needs, so would like some feedback on potential approaches, to maximize potential for a future PR. Here are some things I am considering:
- Modify
fetch_imagesto only control "fetching" (leave defaulted toTrue), and add maybe add an additional parameterprovide_unchecked_images, defaulted toTrueto control if image_urls are returned when not fetched. - Add a parameter
max_images_fetchedthat optionally limits the number of article images considered, similar to how there are settings for article length, minimum sentences etc. This would alleviate issues on sites with dozens of images. - Separate the Image.Scrape assessment and download logic. Ideally find a way to provide images for assessment which can be optionally be downloaded externally in asyncio and provided to a parse phase similar to how
set_htmlcan be used. This unfortunately doesn't look as straight foward yet as the top two options. Worst case option 1, allows a workaround where someone could download images and call into newspaper internal Image methods. - Long Term: I have a question if other techniques for assessment that may be less expensive have been considered (even as options) like looking for images embedded in
top_node(or very proximate to top node) etc.