img2dataset icon indicating copy to clipboard operation
img2dataset copied to clipboard

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Results 164 img2dataset issues
Sort by recently updated
recently updated
newest added

using laion2b-en-aesthetics65.parquet entry #3 "San Pedro: One Of Mother Nature's Most Powerful Psychedelics | Ayahuasca アヤワスカ | Scoop.it" Error: ``` Traceback (most recent call last): File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\downloader.py", line 328, in...

Hi! After downloading the files from laion2b-en with these parameters: ``` download( processes_count=32, url_list=parquet_file, resize_mode='no', output_folder=output_dir, output_format='webdataset', # Download files as a files input_format='parquet', url_col="URL", caption_col="TEXT", number_sample_per_shard=50000, distributor='multiprocessing', ) ```...

Initial implementation for #275

currently, I'm trying to use this piece of code to download images ```python url_list = "oss://mybucket/part-xxxxxxx.parquet" df = spark.read.parquet(url_list) print("count: " + str(df.count())) download( processes_count=2, # this is not used...

https://github.com/rom1504/img2dataset/blob/21c297b089cf7bc30480825af83bf1859862f70d/img2dataset/logger.py#L235-L239 Sometimes, there are macos generated system files that start with "._" and end with ".json". `int()` will throw exception and cause logger exit.

fsspec can open files with various compression modes for you and can infer the compression from the extension + has a nice compression plugin system. We really should be using...

I've been downloading select URLs from LAION-400M, -5B, and SBU and have noticed that there is a significant spike in RAM usage on startup that causes instances with

I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will...

I found it kind of silly that it will try 404 links, it's a waste of resources, the url is not going to magically reappear. Really only 500 errors should...

I've started a Kuberntest cluster and try to start the distributed img2dataset download in cluster mode. The log in the driver pod show as follows: > Starting the downloading of...