Video-Hub-App
Video-Hub-App copied to clipboard
Error in hash function
TL;DR up front: the part of the hash function that is meant to read the middle of the file uses fileSize / 2
to seek to the middle, however that returns a fraction for odd fileSize values, which is not a valid position for fs.read(), which in return sets the value to zero, ipso facto the beginning of the file is read twice. Apparently passing fractional positions to fs.read() will throw an error in up-to-date versions of node.
https://github.com/whyboris/Video-Hub-App/blob/45f91c7f37c1ddf984e7ff329ae58194fabdb70c/node/main-support.ts#L400
I have been trying to reproduce the hash function in Python so that I can insert files into the JSON DB manually and transfer tags to compressed versions of old files, and after hours of attempts I was at my wits end. It seemed, though I wasn't sure, that my function returned the correct hash for files of even length, but not for odd ones. Obviously my interpretation of the above line was at fault, but I just couldn't figure out how. Long story short, after having exhausted all off-by-one variations, I consulted the wise men of the Freenode #javascript channel, where someone eventually realized that, depending on the version of node used (!), said function may throw an error (apparently in the latest version) or it may just set the position value to -1 (?!). This is (as best I can tell), not mentioned in the docs, nor does it make any sense to me - if the function expects an int, how is -1 going to help? Anyway, as it turns out, what it actually does is set the position to 0, to the beginning of the file, and modifying my Python code accordingly did return the expected hash values. Clearly, this is not intended behaviour - fileSize / 2
should instead be rounded down (alΓ‘ Python's // operator).
Also, while I'm here I may as well also mention: there's a lot of needless back-and-forth conversion of values going on in that function.. I'm not familiar with JS, but I would assume it's possible to convert an integer to a byte-like Buffer object directly instead of going through string, and likewise, I would be surprised if the md5 function itself couldn't take byte-like objects. To replicate the results in Python I had to follow the same steps which, at least in Python, is highly redundant. I could have convert the file size (an int) to bytes in one step, concatenated it with the rest , and fed it straight to the md5 function.
Finally, in case anyone would like to follow in my footsteps, here is the Python code for the hash function:
def md5_hash(path):
sample_size = 16 * 1024
sample_threshold = 128 * 1024
file_size = os.path.getsize(path)
with open(path, "rb") as f:
if file_size < sample_threshold:
concatenated = f.read() + bytes(str(file_size), 'utf-8') # read whole file
else:
concatenated = f.read(sample_size) # read beginning
# Should be f.seek(file_size // 2)
f.seek(file_size // 2 if file_size % 2 == 0 else 0) # seek to middle
concatenated += f.read(sample_size) # read middle
f.seek(-sample_size, 2) # seek to end
concatenated += f.read(sample_size) + bytes(str(file_size), 'utf-8') # read end
return hashlib.md5(concatenated.hex().encode()).hexdigest()
So one of the issues I imagine @whyboris will see with this, is that if what you write is true (and I have no reason to disbelieve it) there is a case to be made to leave it as is.
While the hashing function may be flawed: if it is deterministic (equally flawed and consistent for all invocations), and has enough input data to avoid collisions, I would imagine the migration strategy of old hashes to new hashes would be a cumbersome one. "hash" and "newHash" side by side in the config/save state json is certainly doable and could assist in a migration. I suppose this all needs to be weighed against the probability of hash collisions.
Thank you @RedAero for your contribution π I'm finally planning to address this problem with #708 π I'll think about what to do with hash mismatches π
β οΈ this is still not fixed even though I merged in the PR updating the code. I still want to figure out how to gracefully transition from the code that has an error to one that doesn't - so the user's hubs don't lose 50% of their tags and screenshots π
Code in question: https://github.com/whyboris/Video-Hub-App/pull/708/files#diff-84a1283509603799dadbc0d1e879272e0cd9196aa1bd092948cbdf07ba60eca7R400
It would seem to me that, other than simply rebuilding the DB from scratch and merging based on paths (full paths, after all, are as unique as hashes, so the tags can be preserved, if not the screens/clips), the obvious solution would be to:
- iterate through the entries in the DB, create new hashes for them (temporarily call it something like newhash),
- simply rename the old screenshot/clip files to the new hashes,
- finally, delete the old hash entry, rename newhash to hash, and save to disk.
Some error handling w.r.t. filename/hash collisions would probably be a good idea, and if you want to save time you can either only generate hashes for odd filesizes, or only do the rename for hashes that actually changed, but I think that's over-optimization for a one-time function.
As far as execution is concerned, I'd probably implement a quick check on the launch of the new release that checks for a flag in the DB to see if it has already run, if not, it runs this conversion routine, sets a flag that it's been done (in the DB), and you're done. You can use the obvious version tag for this.
By the way, has node.js been updated? Does the old code now throw an error?
Thank you for the detailed suggestion of how to address this problem π π π
I did update node and some other dependencies along the way and it finally showed up as an error - so I "fixed" it in the PR I just merged in - but the "fix" introduced the discrepancy between the hashes -- which I'll need to address before I build a new release.
This is the PR and the "fix": https://github.com/whyboris/Video-Hub-App/pull/708/files#diff-84a1283509603799dadbc0d1e879272e0cd9196aa1bd092948cbdf07ba60eca7R400
I'm keeping the Issue open so that I remember to address it before the bugfix release (hopefully in a month or two) π