pisa
pisa copied to clipboard
Events object hash
Make sure this is good reasonable and fast.
Issues:
- Don't want to re-read entire files if they have already been read in.
- Filename and/or file path is not necessarily a good measure for two files to be identical.
- Checksumming (hashing entire contents of file) might be slow for large files
Proposal:
- Hash on the actual contents of the file (i.e., the events themselves, in the form of Python objects)
- Store this hash in the HDF5 file n the metadata (e.g. "source_hash")
- Since cuts can be applied after the time-consuming hash is computed on all of the contained events, include also the cuts (and possibly the entire sorted metadata dict) in the overall hash
- Only read the metadata node when checking for uniqueness of a file (no need to read entire file)
- If hash is missing:
- Generate new hash based on contents?
- Fall back to using a normalized filename/filepath?
- Fall back to using other identifying characteristics? (E.g., hash on (first N bytes of file + length of file)?)