SigMF icon indicating copy to clipboard operation
SigMF copied to clipboard

Allow additional hash types in Global Object

Open Teque5 opened this issue 2 years ago • 5 comments

I propose we add another optional field called core:md5 with an md5 checksum. This along with the existing optional core:sha256 will cover 99% of use cases.

In the workshop we discussed potentially adding two fields describing the algorithm and the value, but I think it's better to add as few fields as possible. With almost no effort we can support md5sum and it's what the people need.

The edge case described at the workshop was just to validate that the start/middle/end of the file matched for max speed.

Originally discussed GNU Radio Conference 2021.

@bhilburn @jacobagilbert

Teque5 avatar Sep 23 '21 18:09 Teque5

While md5 checksums are faster than sha256, it is only moderately so (50-300% faster depending on the benchmark - the linux commands are about 80% faster on 2GB files on my system), the processing and time to compute depends on the file length, and has the same drawback that every byte in the file must be used (passed over the network if the check is being performed on a NAS/SAN).

Overall I am not convinced that adding md5 as an option provides much over the sha256, and it has the disadvantage of a second way to do basically the same thing.

The fast-data-check proposed by @n-west resolves all of those issues by hashing a small number of samples from the start/middle/end of a file (provided it is at least a certain size). The drawback of this obviously that it does not verify the integrity of the full file. In our experience managing and maintaining large distributed datasets, by far the most common requirement is to know if "major changes" happened (block of data was added to the end, block of data is missing from the end, files were swapped, etc).

jacobagilbert avatar Sep 29 '21 14:09 jacobagilbert

I think the primary disconnect, here, is that we originally specified the hash field really just for fingerprinting purposes, and that's not actually how people are using it. Nathan's use for it, for example, is really not a reliable fingerprint and more of a very coarse integrity check. I think all of these uses make sense, and the way we've currently spec'd the hash precludes some of them and forces people to go do something different in most cases (even if the thing they do care about is fingerprinting).

My personal opinion here is still that we do:

  • Mark core:hash as deprecated but still supported in v1.x (obviously, removing would be break backcompat).
  • Proceed with the tuple approach we described (which we already introduced in Collections): "core:hash": ["hashname", "value"]

@Teque5 - especially interested in your thoughts re: the downsides of 👆👆

bhilburn avatar Sep 30 '21 01:09 bhilburn

I can easily address the need I have for fast file integrity checks in our deepsig.sigmf-ext.md (we can't use this proposal because we use the sha256 checksum for when comprehensive data integrity matters).

jacobagilbert avatar Sep 30 '21 01:09 jacobagilbert

In the name of simplicity I'd prefer to not add a multiple fields to the spec to support additional checksum types if nobody is asking for it. Similarly while NathanW's implementation works for that case I'd rather not define a new type of checksum in the spec either. If there is a well defined existing standard for doing that kind of file integrity then OK.

I'm not usually resource constrained so sha512 is fine. If I'm doing file integrity I use md5sum. I think for data version control and others md5 is also the de-facto standard.

Teque5 avatar Sep 30 '21 04:09 Teque5

Sounds like the consensus is no action required in the core spec? Anyone opposed to closing this issue (@Teque5 @bhilburn)?

jacobagilbert avatar Oct 17 '21 02:10 jacobagilbert