pisa Events object hash

Events object hash

Open jllanfranchi opened this issue 8 years ago • 0 comments

Make sure this is good reasonable and fast.

Issues:

Don't want to re-read entire files if they have already been read in.
Filename and/or file path is not necessarily a good measure for two files to be identical.
Checksumming (hashing entire contents of file) might be slow for large files

Proposal:

Hash on the actual contents of the file (i.e., the events themselves, in the form of Python objects)
Store this hash in the HDF5 file n the metadata (e.g. "source_hash")
Since cuts can be applied after the time-consuming hash is computed on all of the contained events, include also the cuts (and possibly the entire sorted metadata dict) in the overall hash
Only read the metadata node when checking for uniqueness of a file (no need to read entire file)
If hash is missing:
- Generate new hash based on contents?
- Fall back to using a normalized filename/filepath?
- Fall back to using other identifying characteristics? (E.g., hash on (first N bytes of file + length of file)?)

Jun 29 '16 15:06 jllanfranchi