clip-retrieval
clip-retrieval copied to clipboard
Do colab to show how end2end works
Hi, thanks for building this and sharing it. I just want to know if you were able to make any progress with this? I want to run some inference using COCO with a notebook but I having a hard time to make it work.
I have tried with Kaggle downloading the COCO dataset but it seems Kaggle killed the process at some point and no tar file was uncompressed. The quota limit is around 20GB, the process was cancelled when the disk was about 15GB full (after ~2h).
Last logged message was: total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 96 - count: 481753
Edit: After this I tried: !clip-retrieval inference --input_dataset "./mscoco" --output_folder "./embeddings"
but I got an error in line 69 of clip_inference
. keys is None
. Perhaps because the downloading/preprocessing didn't finish?
Do you think there might be a way to cap the number of examples to download? perhaps to 200K or something like that?
I'm right now trying with Colab, but the runtime is constantly disconnecting and reconnecting, not sure if that affects the downloading process. I'll comment here after it finishes.
Edit: Right now I'm in:
!img2dataset \
--url_list /content/sample_data/mscoco.parquet \
--input_format "parquet" \
--url_col "URL" \
--caption_col "TEXT" \
--output_format webdataset \
--output_folder /content/sample_data/mscoco \
--processes_count 16 \
--thread_count 64 \
--image_size 256 \
--enable_wandb False
Thanks!
Indeed these environments are quite unstable Maybe you could try locally? Inference will be slower but it might be ok on a small dataset You can also take only a small subset of the coco input file to start with
I can try locally (CPU only). BTW, how can I take a small subset of coco?
import pandas as pd
df = pd.read_parquet("mscoco.parquet")
df = df [:10000]
df.to_parquet("small.parquet")
Awesome, thanks and sorry for the basic question. I'm new on the vision domain.
Glad that this repo is helping! I'm trying to make everything work both at very large scale and at low scale. It seems there's still some work to do on the low scale part. I'll see what I can provide in term of even smaller dataset so it works best as a quick start on colab.
Thanks Romain. Low scale will indeed be useful for research and learning.
Hey @rom1504 Colab took about 2h but it seems it downloaded all images. This is the latest logged output:
60it [1:09:18, 69.30s/it]
worker - success: 0.000 - failed to download: 0.000 - failed to resize: 1.000 - images per sec: 13 - count: 10000
total - success: 0.000 - failed to download: 0.000 - failed to resize: 1.000 - images per sec: 144 - count: 591753
Nevertheless when running
!clip-retrieval inference --input_dataset "/content/sample_data/mscoco" --output_folder "/content/sample_data/embeddings"
I get:
/usr/local/lib/python3.7/dist-packages/clip/clip.py:23: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
Traceback (most recent call last):
File "/usr/local/bin/clip-retrieval", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/cli.py", line 21, in main
"front": clip_front,
File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 323, in clip_inference
dataset = get_image_dataset()(preprocess, input_dataset, enable_text=enable_text, enable_image=enable_image)
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 69, in __init__
self.keys = list(keys)
TypeError: 'NoneType' object is not iterable
I saw the latest stats JSON file (/content/sample_data/mscoco/00059_stats.json
) and contains this:
{
"count": 1753,
"successes": 0,
"failed_to_download": 0,
"failed_to_resize": 1753,
"duration": 149.17603397369385,
"start_time": 1644969973.1418736,
"end_time": 1644970122.3179076,
"status_dict": {
"module 'albumentations' has no attribute 'longest_max_size'": 1753
}
}
I updated the library's version from 0.1.12
to the latest one (1.1.0
), but the problem is still happening. I imagine this is because that library is used during the preprocessing right? If that is the case I'll update it prior downloading and preprocessing tomorrow and see if it works.
I'll keep you posted.
Update: Even after updating albumentations
to its latest version (it also required to update opencv-python
) I get the same error (TypeError: 'NoneType' object is not iterable
at clip_inference.py, line 69
) when trying to run !clip-retrieval inference --input_dataset "/content/sample_data/mscoco" --output_folder "/content/sample_data/embeddings"
.
You can check the notebook here
I'll appreciate your support.
Thanks!
did you rerun img2dataset and did it say 100% success ?
Previously I run:
!img2dataset \
--url_list /content/sample_data/mscoco.parquet \
--input_format "parquet" \
--url_col "URL" \
--caption_col "TEXT" \
--output_format webdataset \
--output_folder /content/sample_data/mscoco \
--processes_count 16 \
--thread_count 64 \
--image_size 256 \
--enable_wandb False
The last logged output was:
60it [1:49:22, 109.37s/it]
worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 85 - count: 551753
worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 86 - count: 561753
worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 88 - count: 571753
worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 89 - count: 581753
worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000
total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 91 - count: 591753
There was no 100% success message, but it seems it finished. Is there a way to check the img2dataset
process ended correctly?
The content of the latest stats
JSON file (/content/sample_data/mscoco/00059_stats.json
) is:
{
"count": 1753,
"successes": 1753,
"failed_to_download": 0,
"failed_to_resize": 0,
"duration": 234.95621633529663,
"start_time": 1645025798.0262048,
"end_time": 1645026032.9824212,
"status_dict": {
"success": 1753
}
}
Ok, what is the full error printed by clip inference?
On Wed, Feb 16, 2022, 18:12 Iván G. Pérez @.***> wrote:
Previously I run:
!img2dataset
--url_list /content/sample_data/mscoco.parquet
--input_format "parquet"
--url_col "URL"
--caption_col "TEXT"
--output_format webdataset
--output_folder /content/sample_data/mscoco
--processes_count 16
--thread_count 64
--image_size 256
--enable_wandb FalseThe last logged output was:
60it [1:49:22, 109.37s/it] worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 85 - count: 551753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 86 - count: 561753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 88 - count: 571753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 89 - count: 581753 worker - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 8 - count: 10000 total - success: 1.000 - failed to download: 0.000 - failed to resize: 0.000 - images per sec: 91 - count: 591753
There was no 100% success message, but it seems it finished. Is there a way to check the img2dataset process ended correctly?
The content of the latest stats JSON file ( /content/sample_data/mscoco/00059_stats.json) is:
{ "count": 1753, "successes": 1753, "failed_to_download": 0, "failed_to_resize": 0, "duration": 234.95621633529663, "start_time": 1645025798.0262048, "end_time": 1645026032.9824212, "status_dict": { "success": 1753 } }
— Reply to this email directly, view it on GitHub https://github.com/rom1504/clip-retrieval/issues/75#issuecomment-1041893290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437W2TCJZQ5P2RFJ3CNDU3PLGVANCNFSM5IPZD3FQ . You are receiving this because you were mentioned.Message ID: @.***>
Sorry, I forgot to mention:
/usr/local/lib/python3.7/dist-packages/clip/clip.py:23: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
Traceback (most recent call last):
File "/usr/local/bin/clip-retrieval", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/cli.py", line 21, in main
"front": clip_front,
File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "/usr/local/lib/python3.7/dist-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 323, in clip_inference
dataset = get_image_dataset()(preprocess, input_dataset, enable_text=enable_text, enable_image=enable_image)
File "/usr/local/lib/python3.7/dist-packages/clip_retrieval/clip_inference.py", line 69, in __init__
self.keys = list(keys)
TypeError: 'NoneType' object is not iterable
Ok i see, the problem is you saved with the webdataset format but you're then reading with the files format Simply give webdataset to input_format option of clip inference
I will improve the error messages
Oh OK, sorry for that, I was following the related documentation on the img2dataset
repo.
I added the input_format
flag like this: !clip-retrieval inference --input_dataset "/content/sample_data/mscoco" --output_folder "/content/sample_data/embeddings" --input_format webdataset
But I'm having this message and nothing happened:
/usr/local/lib/python3.7/dist-packages/clip/clip.py:23: UserWarning: PyTorch version 1.7.1 or higher is recommended
warnings.warn("PyTorch version 1.7.1 or higher is recommended")
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
0it [00:00, ?it/s]/usr/local/lib/python3.7/dist-packages/webdataset/handlers.py:34: UserWarning: IsADirectoryError(21, 'Is a directory', 'mscoco')
warnings.warn(repr(exn))
0it [00:01, ?it/s]
I see the folders were created under embedding
, but all of them are empty:
I added the flag --num_prepro_workers 2
but still, nothing happened.
you need to give the file names as input like /content/sample_data/mscoco/{00000..00001}.tar with webdataset
I see. OK. !clip-retrieval inference --input_dataset "/content/sample_data/mscoco/00000.tar" --output_folder "/content/sample_data/embeddings" --input_format webdataset --num_prepro_workers 2
seems to be working now.
So this mean I need to manually run the inference command for each dataset part right? (60 in this case).
Choosing --output_format files
makes the process automatic?
Sorry if I'm asking questions present in the documentation. This happens when working in a hurry :)
So this mean I need to manually run the inference command for each dataset part right? (60 in this case).
no, you can put /content/sample_data/mscoco/{00000..00059}.tar
to do the inference on all
OK! Thanks a lot for your support and help Romain. I successfully finished the notebook. I processed just the first 10K examples (/content/sample_data/mscoco/00000
) from COCO dataset, then I made some inference.
I guess Colab is not the right tool to generate the entire index for the 600K examples, but is enough for this kind of test.
As a suggestion, will it be possible to generate a txt
file for each prediction in the output folder when using the filter
command? That text files can contain the score of the prediction and also textual attributes. Also, it would be nice to sort the results by score.
Hey, I'm trying to train a mini dalle2 from lucidrains repo with mscoco dataset. I used img2dataset and this command to get the data:
img2dataset --url_list mscoco.parquet --input_format "parquet"\
--url_col "URL" --caption_col "TEXT" --output_format webdataset\
--output_folder mscoco --processes_count 16 --thread_count 64 --image_size 256\
--enable_wandb True
then, I tried this repo to get the embeddings from the data by using this:
clip-retrieval inference --input_dataset image_folder --output_folder embeddings_folder
which was giving me the same error as @ig-perez:
TypeError: 'NoneType' object is not iterable
so I followed the suggestions in this thread. But now I'm getting this error:
>clip-retrieval inference --input_dataset "mscoco/00000.tar" --output_folder embeddings_folder --input_format webdataset --num_prepro_workers 2
The number of samples has been estimated to be 10000
Traceback (most recent call last):
File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\anejad\AppData\Roaming\Python\Python39\Scripts\clip-retrieval.exe\__main__.py", line 7, in <module>
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\cli.py", line 16, in main
fire.Fire(
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\fire\core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\fire\core.py", line 466, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\fire\core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\main.py", line 144, in main
distributor()
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\distributor.py", line 13, in __call__
self.runner(i)
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\runner.py", line 36, in __call__
batch = iterator.__next__()
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\clip_retrieval\clip_inference\reader.py", line 242, in __iter__
for batch in self.dataloader:
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 368, in __iter__
return self._get_iterator()
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 314, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\anejad\AppData\Roaming\Python\Python39\site-packages\torch\utils\data\dataloader.py", line 927, in __init__
w.start()
File "C:\Program Files\Python39\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Program Files\Python39\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Program Files\Python39\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Program Files\Python39\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
reduction.dump(process_obj, to_child)
File "C:\Program Files\Python39\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'create_webdataset.<locals>.filter_dataset'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Program Files\Python39\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
Do you have any suggestions on how I can fix this? @rom1504 Thank you
You need to specify webdataset type and give the .tar to the command, check the readme for details
In the latest command, I am giving it webdataset type and .tar:
clip-retrieval inference --input_dataset "mscoco/{00000..00059}.tar" --output_folder embeddings_folder --input_format webdataset --num_prepro_workers 2
and I get the error above
I could fix it by setting --num_prepro_workers
to 0.
I think there is an issue with the threads and multiprocessing if you want to look into it.
are you using a virtual env ?
no, could it be that?
Hi, I meet a problem when using clip-retrieval. My dataset path is like this: /data1/train-{00000..00099}.tar. Each tar file contains .jpg and .cls which match each other. I want to use clip-retriecal to get img embedding. I run like this:
clip-retrieval inference --input_dataset /root/data0601/train-0001.tar --output_folder /root/npy0602 --input_format webdataset
I didn't encounter any issue but there is no img embed nor text embed in the output folder. Could I ask how I can fix it?