internetarchive icon indicating copy to clipboard operation
internetarchive copied to clipboard

MD5 hash calculated twice for each uploaded file when using --checksum, --verify, or --delete

Open JustAnotherArchivist opened this issue 7 years ago • 3 comments

Version and platform: 1.7.7 on CPython 3.4.2 on Debian Jessie

When ia upload is used with --checksum, --verify, and/or --delete, the MD5 hash of each file is calculated twice: once at the beginning of the command and again before the upload of each file. This happens because internetarchive.utils.recursive_file_count does the calculation regardless of the value of checksum. internetarchive.item.Item.upload_file later calls internetarchive.utils.get_md5 again directly if any of those options are used.

Some thoughts on how to solve this

The most straightforward solution would be to make the checksum calculation in recursive_file_count dependent on the checksum argument. However, the hash would still be calculated twice when --checksum is used.

It would also be possible to implement a cache into get_md5. This could be done using a weak reference to the file object (weakref.WeakKeyDictionary) to prevent "memory leak" issues without having to manually delete the cache key upon closing a file.

Another approach would be to use a wrapper around the file object which lazily evaluates the hash when it's first accessed and then caches it. Something like:

class File:
    def __init__(self, filename, ...):
        self.raw = open(filename, ...)
        self._md5 = None

    def __getattr__(self, attr):
        return getattr(self.raw, attr)

    @property
    def md5(self):
        if self._md5 is None:
            self.raw.seek(0)
            self._md5 = get_md5(self.raw)
        return self._md5

__getattr__ allows the remaining code to use this object like a normal file object. (It isn't possible to directly subclass file in Python 3 because open returns any of several classes. It may be necessary to add further magic methods depending on how the file objects are used in the library.)

Let me know what you think about these solutions or if you have something else in mind. I'll happily prepare a PR.

JustAnotherArchivist avatar May 14 '18 11:05 JustAnotherArchivist

#77 is somewhat related.

JustAnotherArchivist avatar May 14 '18 12:05 JustAnotherArchivist

I don't believe this is resolved fully. As mentioned by @Dobatymo in the description, that PR "fixes #253 for the case queue_derive is False" (emphasis mine) only. (NB, I haven't tested this, only went through the code and the changes to the relevant parts since I filed the issue.)

Commit 5ba8acb2 from a few months ago is also a partial fix to this issue – it causes ia to only calculate the checksum in recursive_file_count when checksum is True, i.e. when one of those checksum options is used.

In short, here's a summary of the current state as far as I can see:

  • Queue derive, no checksum options: fixed by 5ba8acb2
  • Queue derive, checksum option in use: not fixed, still calculates it twice
  • Don't queue derive, no checksum options: fixed by 5ba8acb2 and #347
  • Don't queue derive, checksum option in use: fixed by #347

@jjjake Please reopen.

JustAnotherArchivist avatar May 23 '20 19:05 JustAnotherArchivist

@JustAnotherArchivist you are right, my PR https://github.com/jjjake/internetarchive/pull/347 only adresses this special case. I think this issue is not fully resolved yet.

Dobatymo avatar May 26 '20 03:05 Dobatymo