img2dataset Implement Exponential Backoff

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

Jul 20 '23 21:07 Skylion007

I'm afraid that will not work if you are scraping a single host. You will need to scrape slowly.

On Thu, Jul 20, 2023 at 11:07 PM Aaron Gokaslan @.***> wrote:

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QZX23KBWWF5563XW3XRGM2VANCNFSM6AAAAAA2R7KQFQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Jul 20 '23 21:07 rom1504

Yeah, unfortunately, it is unclear what the limit is. The docs say we should implement Exponential Backoff which is probably a good policy uniformly.

On Thu, Jul 20, 2023 at 2:09 PM Romain Beaumont @.***> wrote:

I'm afraid that will not work if you are scraping a single host. You will need to scrape slowly.

On Thu, Jul 20, 2023 at 11:07 PM Aaron Gokaslan @.***> wrote:

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAR437QZX23KBWWF5563XW3XRGM2VANCNFSM6AAAAAA2R7KQFQ>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331#issuecomment-1644612489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPVMX3N42WNXFKHRM4VZODXRGNCHANCNFSM6AAAAAA2R7KQFQ . You are receiving this because you authored the thread.Message ID: @.***>

Jul 20 '23 21:07 Skylion007

it does not make sense to implement exponential backoff while maximizing the number of parallel calls. Simply reduce the number of parallel calls

Jul 20 '23 21:07 rom1504

@rom1504 Depending on the settings of the host though, you could hit the rate limit with non-parallel calls in a extreme case though.

Jul 20 '23 21:07 Skylion007

yes, but in that case maybe just do a loop with a wget and a sleep and it'll be simpler and not slower

Jul 20 '23 21:07 rom1504

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

It probably means they don't want you to crawl it~

Jul 29 '23 11:07 ldfandian

Not necessarily, they just don't want you crawling it that fast. Some websites even send a retry after header to tell you to slowdown and try after x seconds.

Jul 30 '23 21:07 Skylion007

img2dataset img2dataset copied to clipboard

Implement Exponential Backoff

img2dataset
img2dataset copied to clipboard