img2dataset issues

Retry only on certain HTTP codes

1

This is an attempt to fix #332 in a simple manner (not using anything fancy like urllib3.Retry). I think it should improve d/l performance significantly on datasets with large amounts...

pabl0

use a proxy when downloading images

1

Is there any way to use a proxy when downloading images? Sometimes some servers can't access the address of the image directly due to network reasons, and intermediate proxies are...

jelech

Implement Robots.txt support

10

Scripts and softwares for automated scrapping must follow robots.txt rules, otherwise it may make the user liable for unauthorised use of data.

Scoppio

enhancement

Add support for the X-Robots-Tag noml header

1

See https://noml.info/ ```diff diff --git a/README.md b/README.md index 12fd5e6..f9b65d1 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ For better performance, it's highly recommended to set up a fast dns...

robrwo

Error reported downloading coyo-700m

When I tried to download coyo-700m with img2dataset, I got an error: "pyarrow.lib.ArrowInvalid: Could not open Parquet input source '': Parquet magic bytes not found in footer. Either the file...

Daming-TF

Refactor as a (self hosted) service

2

https://github.com/rom1504/img2dataset/tree/streaming_refacto some work I started on that some 8 months ago I still think it's the right direction ![Screenshot_20230820_233013](https://github.com/rom1504/img2dataset/assets/2346494/e3433301-be3e-4321-9848-8bd15f0eddd2) may try to finish it soon would close #82 #188 and...

rom1504

[BUGFIX] .txt file with commas in URLs

3

The current implementation seems to fail when the URLs in a `.txt` input file have commas in them. This modification seems to fix the bug. (Disclaimer: I am not 100%...

bkj

Add an example to download fondant25m

https://www.reddit.com/r/StableDiffusion/comments/16v4ld8/25_million_creative_commons_image_dataset_released/?one_tap=true

rom1504

Process hanging forever before the end

13

Hi. When trying to download many images, I often noticed that the job seemed to not make progress anymore around the end. It could remain less than 1% of the...

HugoLaurencon

Ctrl+C doesn't work on Windows 10 using Miniconda

1

process hangs forever

geroldmeisinger

img2dataset
img2dataset copied to clipboard

Metadata

Retry only on certain HTTP codes

use a proxy when downloading images

Implement Robots.txt support

Add support for the X-Robots-Tag noml header

Error reported downloading coyo-700m

Refactor as a (self hosted) service

[BUGFIX] .txt file with commas in URLs

Add an example to download fondant25m

Process hanging forever before the end

Ctrl+C doesn't work on Windows 10 using Miniconda

← Metadata

Owner

Metadata

img2dataset img2dataset copied to clipboard

Metadata

← Metadata

Owner

Metadata

img2dataset
img2dataset copied to clipboard