pisa icon indicating copy to clipboard operation
pisa copied to clipboard

Events object hash

Open jllanfranchi opened this issue 8 years ago • 0 comments

Make sure this is good reasonable and fast.

Issues:

  • Don't want to re-read entire files if they have already been read in.
  • Filename and/or file path is not necessarily a good measure for two files to be identical.
  • Checksumming (hashing entire contents of file) might be slow for large files

Proposal:

  • Hash on the actual contents of the file (i.e., the events themselves, in the form of Python objects)
  • Store this hash in the HDF5 file n the metadata (e.g. "source_hash")
  • Since cuts can be applied after the time-consuming hash is computed on all of the contained events, include also the cuts (and possibly the entire sorted metadata dict) in the overall hash
  • Only read the metadata node when checking for uniqueness of a file (no need to read entire file)
  • If hash is missing:
    • Generate new hash based on contents?
    • Fall back to using a normalized filename/filepath?
    • Fall back to using other identifying characteristics? (E.g., hash on (first N bytes of file + length of file)?)

jllanfranchi avatar Jun 29 '16 15:06 jllanfranchi