scrapy icon indicating copy to clipboard operation
scrapy copied to clipboard

Progress bar for large downloads

Open nemec opened this issue 6 years ago • 6 comments

What are your thoughts on adding a progress bar to the scrapy HTTP handler? I recently wrote a crawler that would scrape a site and throw any files into a FilesPipeline for download. Some of these files were 100+ MB in size which made the terminal seem to "freeze" while they downloaded in the background. I know scrapy isn't really designed to be an efficient file downloader like aria2 or jdownloader, but it's a handy tool and I was already using it to scrape the file list.

I wrote a proof of concept using the Python library tqdm and it went even better than expected - tqdm automatically handles multiple progress bars at a time (scrapy's queue) so I got a clean section at the bottom of the console showing individual progress for each pending file over 5MB in size.

Since I leaned so heavily on tqdm, the change to the scrapy source was only ~15 lines of code to fully implement (the POC patch is at the bottom of this post). If this feature is worth including, I'd expect other changes too since I'm sure you don't want scrapy to take a hard dependency on tqdm and the progress bar should have some configuration options, too.

Screenshot from 2019-05-24 15-39-33

Considerations

  • Disable the progress bar in noninteractive mode (does scrapy have this? how does scrapinghub behave?)
  • Optional dependency on tqdm (or code the feature from scratch within scrapy? - this may be a lot of work)
  • Configurable minimum size threshold for triggering the progress bar.
    • If tqdm is allowed as an optional dependency, the http11 handler should set a warning if devs try to set the minimum threshold but do not have tqdm installed
  • What to do when txresponse.length is UNKNOWN_LENGTH? This can happen if the server does not return a Content-Length header. Should it be disabled entirely? Or monitor _bytes_received and lazily create a progress bar if it crosses the threshold?

Patch

I am using scrapy 1.5.0 in my POC but it looks like the source for http11 in master is unchanged except for the addition of one line disabling lazy, so the patch line numbers are mostly off by one.

--- ~/scrapy-1.5.0/scrapy/core/downloader/handlers/http11.py
+++ ~/.local/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py
@@ -28,6 +28,9 @@
 from scrapy.utils.misc import load_object
 from scrapy.utils.python import to_bytes, to_unicode
 from scrapy import twisted_version
+
+from tqdm import tqdm
+
 
 logger = logging.getLogger(__name__)
 
@@ -432,6 +435,15 @@
         self._reached_warnsize = False
         self._bytes_received = 0
 
+        self.progress = None
+        try:
+            length = int(txresponse.length)
+            # show progress if > 5MB
+            if length > 5242880:
+                self.progress = tqdm(total=length, unit='B', unit_scale=True)
+        except ValueError:
+            pass 
+
     def dataReceived(self, bodyBytes):
         # This maybe called several times after cancel was called with buffered
         # data.
@@ -439,7 +451,10 @@
             return
 
         self._bodybuf.write(bodyBytes)
-        self._bytes_received += len(bodyBytes)
+        new_bytes = len(bodyBytes)
+        self._bytes_received += new_bytes
+        if self.progress is not None:
+            self.progress.update(new_bytes)
 
         if self._maxsize and self._bytes_received > self._maxsize:
             logger.error("Received (%(bytes)s) bytes larger than download "
@@ -460,6 +475,9 @@
                             'request': self._request})
 
     def connectionLost(self, reason):
+        if self.progress is not None:
+            self.progress.close()
+
         if self._finished.called:
             return
 

nemec avatar May 24 '19 21:05 nemec

Maybe this could be refactored as an optional extension (disabled by default) which allows some customization through variables (e.g. the 5MB threshold).

Gallaecio avatar May 27 '19 08:05 Gallaecio

That makes a lot of sense. I'll take a closer look at the Extensions API.

From a quick scan, it seems I might be able to listen for the response_received signal to read the Content-Length header, but it doesn't look like there is a hook to monitor _bytes_received.

nemec avatar May 27 '19 21:05 nemec

If there’s none, it may be time to add it :)

Gallaecio avatar May 28 '19 08:05 Gallaecio

@nemec #4205 was merged, which means now there is a way to monitor the progress of a given request. Are you still interested in adding a progress bar extension?

elacuesta avatar May 11 '20 14:05 elacuesta

A workaround can be to direct all the std.out to a log file using these settings:

LOG_STDOUT = True
LOG_FILE = 'my_spider.log_file'

and use tqdm around the loops for tracking in the terminal

gndps avatar May 30 '20 14:05 gndps

@elacuesta just wanted to pop in and say your linked pull request works great for me. Thanks to the scrapy devs for adding the signal hooks needed :)

nemec avatar May 02 '22 07:05 nemec