imagededup Memory issues with CNN method on large datasets

When working on large datasets (~60k images in my case), there is a memory issue caused by get_cosine_similarity in imagededup/handlers/search/retrieval.py. The exact error is like:

multiprocessing.pool.MaybeEncodingError: Error sending result: '[array([[1.    , 0.3477, 0.3237, ..., 0.2712, 0.463 , 0.4197],
       [0.3477, 1.    , 0.3743, ..., 0.3506, 0.3752, 0.4854],
       [0.3237, 0.3743, 1.    , ..., 0.4365, 0.4438, 0.5205],
       ...,
. Reason: 'MemoryError()'

I've found a simple workaround by replacing these lines https://github.com/idealo/imagededup/blob/81d383ec0774d62439eb34ca1fab21b23d83bacd/imagededup/handlers/search/retrieval.py#L34-L38 with

cos_sim = [cosine_similarity_chunk((X, idxs)) for i, idxs in zip(start_idxs, end_idxs)]

Processing sequentially and omitting creating a large list with Xs seems to solve the issue.

Apr 01 '20 10:04 krolikowskib

There is a trade-off here between the runtime and the image corpus size. Sequential processing is going to increase the deduplication time. But this might be worth a look. @clennan @datitran Would like your opinion on the same.

Sep 16 '20 18:09 tanujjain

cos_sim = [cosine_similarity_chunk((X, idxs)) for i, idxs in zip(start_idxs, end_idxs)]

It seems like it's not not going to work. cosine_similarity_chunk expects idxs to be a tuple of the first and the last index, doesn't it? So the following replacement is correct:

cos_sim = [cosine_similarity_chunk((X, idxs)) for idxs in zip(start_idxs, end_idxs)]

Oct 02 '20 11:10 EduardKononov

@EduardKononov Oops, my bad, you're right of course. Somehow I've pasted the wrong code here even though the modification I tested was correct.

Oct 05 '20 06:10 krolikowskib

Addressed in #185 . Available in v0.3.1

Jan 17 '23 14:01 tanujjain