Memory issues with CNN method on large datasets
When working on large datasets (~60k images in my case), there is a memory issue caused by get_cosine_similarity in imagededup/handlers/search/retrieval.py.
The exact error is like:
multiprocessing.pool.MaybeEncodingError: Error sending result: '[array([[1. , 0.3477, 0.3237, ..., 0.2712, 0.463 , 0.4197],
[0.3477, 1. , 0.3743, ..., 0.3506, 0.3752, 0.4854],
[0.3237, 0.3743, 1. , ..., 0.4365, 0.4438, 0.5205],
...,
. Reason: 'MemoryError()'
I've found a simple workaround by replacing these lines https://github.com/idealo/imagededup/blob/81d383ec0774d62439eb34ca1fab21b23d83bacd/imagededup/handlers/search/retrieval.py#L34-L38 with
cos_sim = [cosine_similarity_chunk((X, idxs)) for i, idxs in zip(start_idxs, end_idxs)]
Processing sequentially and omitting creating a large list with Xs seems to solve the issue.
There is a trade-off here between the runtime and the image corpus size. Sequential processing is going to increase the deduplication time. But this might be worth a look. @clennan @datitran Would like your opinion on the same.
cos_sim = [cosine_similarity_chunk((X, idxs)) for i, idxs in zip(start_idxs, end_idxs)]
It seems like it's not not going to work. cosine_similarity_chunk expects idxs to be a tuple of the first and the last index, doesn't it? So the following replacement is correct:
cos_sim = [cosine_similarity_chunk((X, idxs)) for idxs in zip(start_idxs, end_idxs)]
@EduardKononov Oops, my bad, you're right of course. Somehow I've pasted the wrong code here even though the modification I tested was correct.
Addressed in #185 . Available in v0.3.1