MD5 hash calculated twice for each uploaded file when using --checksum, --verify, or --delete
Version and platform: 1.7.7 on CPython 3.4.2 on Debian Jessie
When ia upload is used with --checksum, --verify, and/or --delete, the MD5 hash of each file is calculated twice: once at the beginning of the command and again before the upload of each file. This happens because internetarchive.utils.recursive_file_count does the calculation regardless of the value of checksum. internetarchive.item.Item.upload_file later calls internetarchive.utils.get_md5 again directly if any of those options are used.
Some thoughts on how to solve this
The most straightforward solution would be to make the checksum calculation in recursive_file_count dependent on the checksum argument. However, the hash would still be calculated twice when --checksum is used.
It would also be possible to implement a cache into get_md5. This could be done using a weak reference to the file object (weakref.WeakKeyDictionary) to prevent "memory leak" issues without having to manually delete the cache key upon closing a file.
Another approach would be to use a wrapper around the file object which lazily evaluates the hash when it's first accessed and then caches it. Something like:
class File:
def __init__(self, filename, ...):
self.raw = open(filename, ...)
self._md5 = None
def __getattr__(self, attr):
return getattr(self.raw, attr)
@property
def md5(self):
if self._md5 is None:
self.raw.seek(0)
self._md5 = get_md5(self.raw)
return self._md5
__getattr__ allows the remaining code to use this object like a normal file object. (It isn't possible to directly subclass file in Python 3 because open returns any of several classes. It may be necessary to add further magic methods depending on how the file objects are used in the library.)
Let me know what you think about these solutions or if you have something else in mind. I'll happily prepare a PR.
#77 is somewhat related.
I don't believe this is resolved fully. As mentioned by @Dobatymo in the description, that PR "fixes #253 for the case queue_derive is False" (emphasis mine) only. (NB, I haven't tested this, only went through the code and the changes to the relevant parts since I filed the issue.)
Commit 5ba8acb2 from a few months ago is also a partial fix to this issue – it causes ia to only calculate the checksum in recursive_file_count when checksum is True, i.e. when one of those checksum options is used.
In short, here's a summary of the current state as far as I can see:
- Queue derive, no checksum options: fixed by 5ba8acb2
- Queue derive, checksum option in use: not fixed, still calculates it twice
- Don't queue derive, no checksum options: fixed by 5ba8acb2 and #347
- Don't queue derive, checksum option in use: fixed by #347
@jjjake Please reopen.
@JustAnotherArchivist you are right, my PR https://github.com/jjjake/internetarchive/pull/347 only adresses this special case. I think this issue is not fully resolved yet.