img2dataset
img2dataset copied to clipboard
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
using laion2b-en-aesthetics65.parquet entry #3 "San Pedro: One Of Mother Nature's Most Powerful Psychedelics | Ayahuasca アヤワスカ | Scoop.it" Error: ``` Traceback (most recent call last): File "%USERPROFILE%\miniconda3\envs\controlnet\Lib\site-packages\img2dataset\downloader.py", line 328, in...
Hi! After downloading the files from laion2b-en with these parameters: ``` download( processes_count=32, url_list=parquet_file, resize_mode='no', output_folder=output_dir, output_format='webdataset', # Download files as a files input_format='parquet', url_col="URL", caption_col="TEXT", number_sample_per_shard=50000, distributor='multiprocessing', ) ```...
Initial implementation for #275
currently, I'm trying to use this piece of code to download images ```python url_list = "oss://mybucket/part-xxxxxxx.parquet" df = spark.read.parquet(url_list) print("count: " + str(df.count())) download( processes_count=2, # this is not used...
https://github.com/rom1504/img2dataset/blob/21c297b089cf7bc30480825af83bf1859862f70d/img2dataset/logger.py#L235-L239 Sometimes, there are macos generated system files that start with "._" and end with ".json". `int()` will throw exception and cause logger exit.
fsspec can open files with various compression modes for you and can infer the compression from the extension + has a nice compression plugin system. We really should be using...
I've been downloading select URLs from LAION-400M, -5B, and SBU and have noticed that there is a significant spike in RAM usage on startup that causes instances with
I am currently using this tool on a website that uses AWS CloudFront to host all their images. However, if you do too many queries to their urls, you will...
I found it kind of silly that it will try 404 links, it's a waste of resources, the url is not going to magically reappear. Really only 500 errors should...
I've started a Kuberntest cluster and try to start the distributed img2dataset download in cluster mode. The log in the driver pod show as follows: > Starting the downloading of...