Gryptonite Add a Hash column to the file table, so we can expand on file features

I was imagining a feature that would recognize if you're adding a file that's already in the database. In which case it would not add the new file, just link to the old file. Because of the way files are implemented, any number of entries can safely reference the same file, and each can have its own unique metadata associated with it (file name, etc...)

This could be implemented easily using a hash column in the file table (make sure the hash function is cryptographically secure!), and making an index that lets you look up file hashes in O(1) time. The cost of having to check for a hash collision is extremely high, so it should be avoided at all costs (use a large enough hash like SHA3-512, and we'll assume no collisions because the database is so small).

May 18 '15 12:05 karagog

Some notes on security: Since you're using hashes, you have to concern yourself with preimage attacks. You cannot simply solve this by using salt, because we need to be able to generate a lookup table from the hashes. I think this means we have to encrypt each hash (without a MAC is okay) so we are the only ones who know what they are.

May 18 '15 13:05 karagog

Notes on hash algorithm: I've been thinking about which hash algorithm to use, because we need to balance the risk/cost of collisions with the size on disk (we don't care about the hash security if we are going to encrypt them anyways). Since disk space is cheap, that will be the least of our concerns.

First we must understand the cost of a hash collision. If there is a hash collision it means we have to compare the files byte for byte (or at least enough bytes until we're satisfied they're the same). That means selecting the file out of the database and exporting the plaintext to disk, because files can be arbitrarily large and memory is finite. If the database is on a network drive this is especially wasteful. The high cost of a collision is enough of a deterrent that we should take appropriate measures to guarantee there are no collisions, which is in fact reasonable given that we're not on a huge scale. If there are no collisions then we won't even consider this case.

So how do we "guarantee" no collisions? Let's assume a random distribution of hashes. Using the same guidelines from NIST about when to replace your encryption key, we'll say that when the probability of collision reaches 1/(2^32) then we're no longer safe. I found the following formula online to compute this probability:

Python (I'll add a Matlab version later):

import math

def compute_prob_collision():
    # N is the size of the hash space (i.e. how many unique hashes there are)
    # 2^64
    N = 18446744073709551616.0

    # k is the number of files 
    max_k = 10000.0
    increment = 10.0
    k=increment
    while k < max_k:
        # Print the probability of collision for the number of files
        print k, 1 - math.exp(-0.5 * k * (k - 1) / N)
        k+=increment

compute_prob_collision()

It turns out we can store on the order of 10,000 files with a 64-bit hash before this breaks and the probability of collision reaches 1/(2^32). That should work for most people, but I'm thinking about the worst case user who collects lots of files. I would like to support up to 10 million files, which is probably unreasonable but hey it could happen. I haven't calculated it yet (because it requires Matlab for the large integer types) but SHA-224 should be large enough I think.

May 18 '15 14:05 karagog