img2dataset
img2dataset copied to clipboard
Implement Exponential Backoff
I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.
I'm afraid that will not work if you are scraping a single host. You will need to scrape slowly.
On Thu, Jul 20, 2023 at 11:07 PM Aaron Gokaslan @.***> wrote:
I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437QZX23KBWWF5563XW3XRGM2VANCNFSM6AAAAAA2R7KQFQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Yeah, unfortunately, it is unclear what the limit is. The docs say we should implement Exponential Backoff which is probably a good policy uniformly.
On Thu, Jul 20, 2023 at 2:09 PM Romain Beaumont @.***> wrote:
I'm afraid that will not work if you are scraping a single host. You will need to scrape slowly.
On Thu, Jul 20, 2023 at 11:07 PM Aaron Gokaslan @.***> wrote:
I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAR437QZX23KBWWF5563XW3XRGM2VANCNFSM6AAAAAA2R7KQFQ>
. You are receiving this because you are subscribed to this thread.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/rom1504/img2dataset/issues/331#issuecomment-1644612489, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPVMX3N42WNXFKHRM4VZODXRGNCHANCNFSM6AAAAAA2R7KQFQ . You are receiving this because you authored the thread.Message ID: @.***>
it does not make sense to implement exponential backoff while maximizing the number of parallel calls. Simply reduce the number of parallel calls
@rom1504 Depending on the settings of the host though, you could hit the rate limit with non-parallel calls in a extreme case though.
yes, but in that case maybe just do a loop with a wget and a sleep and it'll be simpler and not slower
I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will receive a 429 Error that does not provide any Retry After header field. This means that img2dataset will just keep hammering the CloudFront causing the ban to be extended. The only way to still allow images to be downloaded is to have an exponential backoff applied. This is also just good practice when scraping websites so it would be great if it could be included.
It probably means they don't want you to crawl it~
Not necessarily, they just don't want you crawling it that fast. Some websites even send a retry after header to tell you to slowdown and try after x seconds.