img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

Implement Exponential Backoff

Open Skylion007 opened this issue 1 year ago • 7 comments

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

Skylion007 avatar Jul 20 '23 21:07 Skylion007

I'm afraid that will not work if you are scraping a single host. You will need to scrape slowly.

On Thu, Jul 20, 2023 at 11:07 PM Aaron Gokaslan @.***> wrote:

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QZX23KBWWF5563XW3XRGM2VANCNFSM6AAAAAA2R7KQFQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rom1504 avatar Jul 20 '23 21:07 rom1504

Yeah, unfortunately, it is unclear what the limit is. The docs say we should implement Exponential Backoff which is probably a good policy uniformly.

On Thu, Jul 20, 2023 at 2:09 PM Romain Beaumont @.***> wrote:

I'm afraid that will not work if you are scraping a single host. You will need to scrape slowly.

On Thu, Jul 20, 2023 at 11:07 PM Aaron Gokaslan @.***> wrote:

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAR437QZX23KBWWF5563XW3XRGM2VANCNFSM6AAAAAA2R7KQFQ>

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331#issuecomment-1644612489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPVMX3N42WNXFKHRM4VZODXRGNCHANCNFSM6AAAAAA2R7KQFQ . You are receiving this because you authored the thread.Message ID: @.***>

Skylion007 avatar Jul 20 '23 21:07 Skylion007

it does not make sense to implement exponential backoff while maximizing the number of parallel calls. Simply reduce the number of parallel calls

rom1504 avatar Jul 20 '23 21:07 rom1504

@rom1504 Depending on the settings of the host though, you could hit the rate limit with non-parallel calls in a extreme case though.

Skylion007 avatar Jul 20 '23 21:07 Skylion007

yes, but in that case maybe just do a loop with a wget and a sleep and it'll be simpler and not slower

rom1504 avatar Jul 20 '23 21:07 rom1504

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.

It probably means they don't want you to crawl it~

ldfandian avatar Jul 29 '23 11:07 ldfandian

Not necessarily, they just don't want you crawling it that fast. Some websites even send a retry after header to tell you to slowdown and try after x seconds.

Skylion007 avatar Jul 30 '23 21:07 Skylion007