imagededup
imagededup copied to clipboard
cnn, taking a long time or not working.
I did not get the result i was expecting using the hashing methods so i decided to try the cnn.
I left it in the evening at INFO Large feature matrix thus calculating cosine similarities in chunks... 0%| | 0/127 [00:00<?, ?it/s]
Then i ran it over night and woke up 10 hours later to INFO Large feature matrix thus calculating cosine similarities in chunks... 0%| | 0/127 [00:00<?, ?it/s]
So i am guessing it is not working? Or should i let it run for longer? I am running it on google colab PROs GPU with 27GB RAM.
Its 127.000 images of various sizes.
Currently, imagededup is not well suited for the scale of images you have. We have tried upto 60K images on a colab notebook similar in configuration to yours and it worked then. The exact issue you're facing is due to the similarity computation, which requires calculating a 127k x 127k dense matrix in your case, which blaots up the memory requirements. This step has been optimized by chunking the matrix and processing different chunks on different processes. This can take incredibly long even if memory doesn't barf.
We're currently trying similarity computation/search algorithms to help us tackle deduplication at scale (which usually trade off deduplication accuracy for speed/memory). You can try the below code:
- Install nmsib
pip install nmslib
from imagededup.methods import CNN
cnn = CNN()
encodings = cnn.encode_images('path/to/images') # In your case, this should be a dictionary with 127K entries (if there are no corrupt images or images with unsupported format)
# Large scale similarity search
import nmslib
import numpy as np
data = np.array(list(encodings.values()))
index = nmslib.init(method='hnsw', space='cosinesimil')
index.addDataPointBatch(data)
# Set index parameters
M = 40 # Max links per node
efConstruction = 40 # Size of the dynamic list used during construction. A larger value means a better quality index, but increases build time. Should be an integer value between 1 and the size of the dataset.
num_threads = 4
index_time_params = {'M': M, 'indexThreadQty': num_threads, 'efConstruction': efConstruction, 'post' : 0} # 'post': postprocessing
index.createIndex(index_time_params, print_progress=True)
K = data.shape[0] # number of neigbours (setting to the size of dataset, usual practice is to specify a value such as 100 or so)
efSearch = 50 # Size of the dynamic list used during search. Higher values lead to improved recall at the expense of longer search time. Can take values between k and the size of the dataset and may be greater or smaller than ef_construction. Typical values are 100 - 2000.
query_time_params = {'efSearch': efSearch}
print('Setting query-time parameters', query_time_params)
index.setQueryTimeParams(query_time_params)
neighbours = index.knnQueryBatch(data, k=K)
def retrieve_neighbours_one_file(neighbours_onefile, onefile_matrix_row_index, sim_thresh, all_filenames):
# gets duplicates for one file
self_retrived_file_pos = np.where(neighbours_onefile[0] == onefile_matrix_row_index) # Avoid self retrieval
neighbours_onefile_files = np.delete(neighbours_onefile[0], self_retrived_file_pos)
neighbours_onefile_sims = np.delete(neighbours_onefile[1], self_retrived_file_pos)
sim_neighbors = 1 - neighbours_onefile_sims # convert distance to similarity
thresh_sims = sim_neighbors[np.where(sim_neighbors >= sim_thresh)]
thresh_neighbours = neighbours_onefile_files[np.where(sim_neighbors >= sim_thresh)]
thresh_neighbours_filenames = [all_filenames[i] for i in thresh_neighbours]
dups = list(zip(thresh_neighbours_filenames, thresh_sims))
return dups
filenames = list(encodings.keys())
file_matrix_inds = range(data.shape[0])
min_sim_threshold = 0.9
res = list(map(retrieve_neighbours_one_file, neighbours, file_matrix_inds, [min_sim_threshold] * data.shape[0], [filenames] * data.shape[0]))
duplicates = dict(zip(filenames, res))
This should give you an output equivalent to find_duplicates
with scores
parameter set to True. (The code is kind of messy, but should hopefully be enough to get you decent results).
Caveats:
- The largescale method used here (hnsw) trades off accuracy of deduplication with speed/memory consumption. This trade-off can be controlled by tuning hyperparameters (M, efConstruction, K, efSearch).
- I have set some sensible defaults for the hyperparameters, but you should try to tinker with the values if it doesn't give you the kind of results you want/takes too much time or memory. You can learn more about the hyperparameter tuning here and the find the paper here.
It would be great if you can share the results of your experiment (at least time taken and maybe more if you wish) as we're still trying to understand the algorithm and its hyperparameters.
@tanujjain how about storing the encoded image result - numpy in apache solr and using solr functions to implement the similarity search at scale?
there's another paper i was going through - https://cmp.felk.cvut.cz/~chum/papers/chum_bmvc08.pdf
let me know your view on this.
Currently, imagededup is not well suited for the scale of images you have. We have tried upto 60K images on a colab notebook similar in configuration to yours and it worked then. The exact issue you're facing is due to the similarity computation, which requires calculating a 127k x 127k dense matrix in your case, which blaots up the memory requirements. This step has been optimized by chunking the matrix and processing different chunks on different processes. This can take incredibly long even if memory doesn't barf.
We're currently trying similarity computation/search algorithms to help us tackle deduplication at scale (which usually trade off deduplication accuracy for speed/memory). You can try the below code:
- Install nmsib
pip install nmslib
from imagededup.methods import CNN cnn = CNN() encodings = cnn.encode_images('path/to/images') # In your case, this should be a dictionary with 127K entries (if there are no corrupt images or images with unsupported format) # Large scale similarity search import nmslib import numpy as np data = np.array(list(encodings.values())) index = nmslib.init(method='hnsw', space='cosinesimil') index.addDataPointBatch(data) # Set index parameters M = 40 # Max links per node efConstruction = 40 # Size of the dynamic list used during construction. A larger value means a better quality index, but increases build time. Should be an integer value between 1 and the size of the dataset. num_threads = 4 index_time_params = {'M': M, 'indexThreadQty': num_threads, 'efConstruction': efConstruction, 'post' : 0} # 'post': postprocessing index.createIndex(index_time_params, print_progress=True) K = data.shape[0] # number of neigbours (setting to the size of dataset, usual practice is to specify a value such as 100 or so) efSearch = 50 # Size of the dynamic list used during search. Higher values lead to improved recall at the expense of longer search time. Can take values between k and the size of the dataset and may be greater or smaller than ef_construction. Typical values are 100 - 2000. query_time_params = {'efSearch': efSearch} print('Setting query-time parameters', query_time_params) index.setQueryTimeParams(query_time_params) neighbours = index.knnQueryBatch(data, k=K) def retrieve_neighbours_one_file(neighbours_onefile, onefile_matrix_row_index, sim_thresh, all_filenames): # gets duplicates for one file self_retrived_file_pos = np.where(neighbours_onefile[0] == onefile_matrix_row_index) # Avoid self retrieval neighbours_onefile_files = np.delete(neighbours_onefile[0], self_retrived_file_pos) neighbours_onefile_sims = np.delete(neighbours_onefile[1], self_retrived_file_pos) sim_neighbors = 1 - neighbours_onefile_sims # convert distance to similarity thresh_sims = sim_neighbors[np.where(sim_neighbors >= sim_thresh)] thresh_neighbours = neighbours_onefile_files[np.where(sim_neighbors >= sim_thresh)] thresh_neighbours_filenames = [all_filenames[i] for i in thresh_neighbours] dups = list(zip(thresh_neighbours_filenames, thresh_sims)) return dups filenames = list(encodings.keys()) file_matrix_inds = range(data.shape[0]) min_sim_threshold = 0.9 res = list(map(retrieve_neighbours_one_file, neighbours, file_matrix_inds, [min_sim_threshold] * data.shape[0], [filenames] * data.shape[0])) duplicates = dict(zip(filenames, res))
This should give you an output equivalent to
find_duplicates
withscores
parameter set to True. (The code is kind of messy, but should hopefully be enough to get you decent results).Caveats:
- The largescale method used here (hnsw) trades off accuracy of deduplication with speed/memory consumption. This trade-off can be controlled by tuning hyperparameters (M, efConstruction, K, efSearch).
- I have set some sensible defaults for the hyperparameters, but you should try to tinker with the values if it doesn't give you the kind of results you want/takes too much time or memory. You can learn more about the hyperparameter tuning here and the find the paper here.
It would be great if you can share the results of your experiment (at least time taken and maybe more if you wish) as we're still trying to understand the algorithm and its hyperparameters.
In my case, I run cnn.find_duplicates() and taking a long time no response。 I try the code above, when I use a image sets with 25k images, it work out at about 10 minutes and the result seems ok. but when I try to run on a sets with 177k images, it runs over an hour without response. I will try to find out the problem.
One more thing, the duplicates out format is different to cnn.find_duplicates(). Besides the filename, it will figure out the score of the file.
===Update==== 177k images datasets is finished.