Duplicate-Image-Finder
Duplicate-Image-Finder copied to clipboard
Key error during union search if any invalid files
Thanks for creating this wonderful tool. I'm using it to deduplicate photo albums going back to my days with 35mm film. Simple file matching tools won't find photos that were accidentally scanned more than once.
I've found a probable bug, and I'm reporting the bug and a workaround.
In the file dif.py, the function _build_image_dictionaries() has this code at about line 182:
file_nums = [(i, valid_files[i]) for i in range(len(valid_files))]
Just after that there is logic that checks for invalid files and records those, but also adds valid files to the dictionaries. Invalid files are never added to the dictionaries, but the count is incremented.
The result of this is that there can be gaps in the file numbers. The build process works fine, but during the union search phase there will be a key error. When I first encountered this, I thought it must be a duplicate key, but it's actually a missing key.
I added some scaffold code to dump out the filename dictionary and the list of invalid files to an extra scratch log, and I found the numbering gaps.
It's not clear to me whether the correct solution would be to not increment the file count for invalid files, or to put dummy items into the dictionaries in place of invalid files (but keep their filenames in that dictionary for logging?). My workaround has been to clean up or delete the faulty image files, after which re-running the same operation will succeed.
I had a KeyError because of a few bad files. I moved them out of the folder, and it is working now. They were all under 1kb.
I think this is the same issue?
difPy preparing files: [100%]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 416, in _find_matches_batch
tensor_B_list = np.asarray([self.__difpy_obj._tensor_dictionary[x[1]] for x in ids])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 416, in <listcomp>
tensor_B_list = np.asarray([self.__difpy_obj._tensor_dictionary[x[1]] for x in ids])
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
KeyError: 419
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 921, in <module>
se = search(dif, similarity=args.similarity, rotate=args.rotate, lazy=args.lazy, processes=args.processes, chunksize=args.chunksize)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 252, in __init__
self.result, self.lower_quality, self.stats = self._main()
^^^^^^^^^^^^
File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 266, in _main
result = self._search_union()
^^^^^^^^^^^^^^^^^^^^
File "[...]/.venv/lib/python3.11/site-packages/difPy/dif.py", line 303, in _search_union
for output in pool.imap_unordered(self._find_matches_batch, self._yield_comparison_group(), self.__chunksize):
File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 451, in <genexpr>
return (item for chunk in result for item in chunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "[...]/python/3.11.9/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
KeyError: 419
I'm running this on a very large set of photos (~1TB), so there certainly could be some "bad" files if that is the cause for this. Not sure how to go about finding which file(s) caused this. Any suggestions?
So the issue seems to be any warnings from the _generate_tensor function coming from PIL.
If you want to at least see the files causing the issue you can add:
warnings.simplefilter('error', UserWarning)
warnings.simplefilter('error', Image.DecompressionBombWarning)
right before the Image.open
, if you get any different warnings add a new line with the warning type.
Then add a print message like the following in the exception section:
print(f"Error loading image: {num} -> '{file}' -> {e}")
Function should look like
def _generate_tensor(self, num: int, file: str) -> dict:
# Function that generates a tensor of an image
try:
warnings.simplefilter('error', UserWarning)
warnings.simplefilter('error', Image.DecompressionBombWarning)
img = Image.open(file)
if img.getbands() != ('R', 'G', 'B'):
img = img.convert('RGB')
shape = np.asarray(img).shape # new
img = img.resize((self.__px_size, self.__px_size), resample=Image.BICUBIC)
img = np.asarray(img)
return (num, img, shape)
except Exception as e:
print(f"Error loading image: '{file}' -> {e}")
if e.__class__.__name__== 'UnidentifiedImageError':
return {str(Path(file)) : 'UnidentifiedImageError: file could not be identified as image.'}
else:
return {str(Path(file)) : str(e)}
It will still throw the error in the search, the key error will be one of the {num}s that a warning gets thrown for. So now to figure out whats still trying to read that id, but at least you can see what images are causing issues.
Note this function has issues with PNGs that have transparencies.
This throws a "Palette images with Transparency expressed in bytes should be converted to RGBA images" warning.