ISIC-Archive-Downloader icon indicating copy to clipboard operation
ISIC-Archive-Downloader copied to clipboard

Skip sample if it takes too long to download

Open pkill37 opened this issue 5 years ago • 3 comments

Sometimes the script hangs while trying to download a specific description or image. When the user requests a specific number k of samples, it would be nice if the script skipped or retried samples that are taking too long to download.

Right now I've been waiting for a while to download 250 malignant samples because it has been stuck trying to download the 197th for a few hours.

For the record, after I gave up and killed the process the exceptions revealed

requests.exceptions.ConnectionError: HTTPSConnectionPool(host='isic-archive.com', port=443): Max retries exceeded with url: /api/v1/image/54e7ddbbbae4780ec59cde5f (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x10f3b2518>: Failed to establish a new connection: [Errno 60] Operation timed out',))

pkill37 avatar Feb 02 '19 18:02 pkill37

A temporary poor man's solution could be to identify the offset i of the last successfully downloaded sample (in my case 196) and manually rerun the script starting at offset i+2 (in my case 198), thus skipping the problematic sample.

This would be fine. However this does not work because it seems that filenames are not exactly relative to the requested offset and number of samples (I'm not exactly sure what's going on), and some samples end up being overwritten because they are saved with the same filename as previously downloaded samples.

pkill37 avatar Feb 02 '19 18:02 pkill37

For what it's worth, I've ran the script under various locations with my VPN (i.e. different IP address) and even asked a friend to run it for me. I have also tried running with --p 1 thinking that maybe all those processes were spamming the API too much. But the results are always the same, it always hangs at #160 for a while and then #197 for much longer. So I don't think the API is rate limitting us or anything like that.

You should be able to reproduce it the same way if you run this right now

python download_archive.py --num-images 250 --filter malignant

Do you understand why this is happening?

pkill37 avatar Feb 02 '19 20:02 pkill37

Hey! Thank you for this great idea! :)
And regarding the issues that you mention - I think some images are not downloadable for some reason.
You can even try to download them using the link in your browser and it won't be able to.
So I guess image #197 is one of these images. On the other hand I can't think of a reason yet for the hanging on image #160. I will try to reproduce it myself when i'll have the time soon.

Btw, in order to skip the problematic samples in a more elegant way we could use the max_tries parameter which is present in some of the download functions in "download_single_item.py ", and add that parameter to the functions that don't have it.
I have put that parameter in some of the download functions with a default value of infinite tries, but never really given the user a way to specify another value in cases of problematic images such as you described.

So I guess that when i'll have more time, that will be the way to implement this request

GalAvineri avatar Feb 02 '19 22:02 GalAvineri