earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

Checksum verification of downloaded granules

Open rupesh2 opened this issue 1 year ago • 16 comments
trafficstars

Checksums are available as a part of UMM-G records for some datasets (e.g., Daymet provides SHA-256; GHRSST provides MD5).

earthaccess.download() should verify the integrity of the downloaded granules against the checksum hashes, where available. This work will add such validations for downloaded files.

rupesh2 avatar Feb 08 '24 19:02 rupesh2

When checksums are available, what do we think the behavior should be?

In my weak opinion, by default, earthaccess should verify and print a warning if verification fails. We can provide arguments to disable verification or to upgrade those warnings to errors.

mfisher87 avatar Feb 08 '24 19:02 mfisher87

Thanks @mfisher87 ! Printing a warning when the verification fails would be a good start.

rupesh2 avatar Feb 08 '24 19:02 rupesh2

Some examples of DAACs using checksums:

Organization Algorithm Examples
ORNLDAAC SHA-256 https://cmr.earthdata.nasa.gov/search/concepts/G2625060389-ORNL_CLOUD.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{"Checksum":{"Value":"e15a43eb6914bf594833ff40d9c849adf08acdfa13b67e343308cceb5901b462","Algorithm":"SHA-256"}}}
PODAAC MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2857127720-POCLOUD.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{"Checksum":{"Value":"210130f6e8f61d7976f5405f9e925f98","Algorithm":"MD5"}}}
ASF* MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2895561045-ASF.umm_json
"AdditionalAttributes":[{"Name":"MD5SUM","Values":["764bf6dbe12eaf73f8e316924b409ded"]}]
LAADS MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2895709317-LAADS.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"27504ce476722f8c6f55551d9dc59455","Algorithm":"MD5"}}}
LARC MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2829371222-LARC_CLOUD.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"92df1ae596bf28bd0b966145ba76599b","Algorithm":"MD5"}}}
LANCE MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2895741728-LANCEMODIS.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"7987f2d56f15da34101dedc671715704","Algorithm":"MD5"}}}
GHRC -  
GESDISC -  
LPDAAC -  
NSIDC -  
CDDIS -  
SEDAC -  

*Checksums are not available for all datasets

rupesh2 avatar Mar 14 '24 18:03 rupesh2

Since checksums are not available for all datasets, I'm thinking we should print a warning when we try to verify and checksums aren't available? What do you think?

mfisher87 avatar Mar 15 '24 00:03 mfisher87

I was thinking:

  • If checksums are available, verify them upon download. Print a warning message if the checksum does not validate
  • If checksums are unavailable, there will be no checksum verification or warning messages. (If we display warning messages here, there would be too many warning messages)

rupesh2 avatar Mar 15 '24 13:03 rupesh2

Good thinking, I like that too. We can also always make that behavior more configurable with feature flags going forward if users want to be able to customize it

mfisher87 avatar Mar 15 '24 15:03 mfisher87

cc @Sherwin-14

mfisher87 avatar May 21 '24 00:05 mfisher87

@mfisher87 I am thinking of implementing the solution discussed by @rupesh2. Do you have any specific opinions regarding this or should I proceed forward?

Sherwin-14 avatar May 22 '24 14:05 Sherwin-14

I think Rupesh's design sounds like a great path forward.

Next steps after that should probably be flags on earthaccess.download():

  • disable_checksum_validation: bool = False: Opt out of the validation
  • raise_on_checksum_validation_failure: bool = False Opt in to raising an exception (instead of logging a warning) when validation fails to enable programmatic handling by the user

(I'm sure someone can come up with better argument names than me :laughing:)

Perhaps these should be tackled as separate issues? No strong feelings here :)

mfisher87 avatar May 22 '24 18:05 mfisher87

I'd prefer something a bit more unified, perhaps by using a single parameter that is not boolean, particularly since disable_checksum_validation=True means that raise_on_checksum_validation_failure has no meaning.

Perhaps a single parameter named validation that is an Enum: WARN (default), FAIL, or SKIP. I'm not sure I'm totally loving that, but that's the sort of direction I'd suggest.

chuckwondo avatar May 22 '24 19:05 chuckwondo

That's a great point, I like your idea :)

mfisher87 avatar May 22 '24 20:05 mfisher87

Super interesting thread - it looks like we never published checksums in CMR at GES DISC but we have records of that internally and validate using checksums our migration to the cloud - will bring it up internally.

briannapagan avatar Aug 13 '24 20:08 briannapagan

Amazing! Thanks, Brianna :)

mfisher87 avatar Aug 13 '24 22:08 mfisher87

Super interesting thread - it looks like we never published checksums in CMR at GES DISC but we have records of that internally and validate using checksums our migration to the cloud - will bring it up internally.

Brianna is correct. At GES DISC, for the granules we previously migrated to the cloud from on-prem, the checksum was not published to CMR due to certain reasons. However, for the further ingest from another cloud data provider, the checksum will be published to CMR if provided by the provider.

Now if earthaccess.download() can validate granules based on the checksum if any, we should consider adding checksums to our already migrated granules so that earthaccess users can benefit from this feature when getting our data. This would require some effort though, and we will have some internal discussion on that...

hailiangzhang avatar Aug 14 '24 15:08 hailiangzhang

This would require some effort though, and we will have some internal discussion on that...

Thanks so much for having this conversation! :bow:

mfisher87 avatar Aug 14 '24 15:08 mfisher87

Hi folks - I'm the TL of the Google Earth Engine Data team.

We mirror a lot of datasets. The most common problem we run into is missing assets/files. The second, much more rare one, is truncated files. Truncated files are easily fixed by making sure the jobs are doing atomic copies, but catching missing files can be hard when the dataset listings are massive and continuously updated.

To be honest, I never saw in 15 years a download problem that would be caught only by verifying checksums. They have their value - e.g., we use them to verify data conversion, but we checksum actual data bytes, not just files, because tiny changes in file formats would make checksums change.

So I'd recommend weighing the effort of maintaining and verifying checksums against the usefulness of such checks. I would be much more interested in more robust file listings (e.g., CMR is not easy to scan for huge datasets).

simonff avatar Aug 15 '24 19:08 simonff