datapack icon indicating copy to clipboard operation
datapack copied to clipboard

serializeToBagit ignores DataObjects that remotely reference data

Open mbjones opened this issue 5 years ago • 5 comments

A DataObject can include a dataURL that indicates that the bytes of the object are remotely stored on another server, rather than being either in memory or on the local filesystem (which are the other two options). When serializing a DataPackage to disk in BagIt format, the serializeToBagit function skips over any data objects that use the dataURL slot as the reference to data, thus breaking support for this serialization.

To fix, either:

  • during creation of the BagIt, download remote objects and serialize them like others
  • during creation of the BagIt, serialize remotely referenced data objects by reference in the fetch.txt file, preserving them as remote

The challenge with the second approach is we still need checksums for the remote objects. Technically this shoul dbe in the SystemMetadata for the DataObject, but its likely it was not calculated. If the remote object is a DataONE object, then the SystemMetadata should have the needed checksum.

Relates to issue #3 and #119

mbjones avatar Aug 28 '20 23:08 mbjones

I don't see a way to reliably implement the second option. Currently 'datapack' is creating an MD5 payload manifest, which is required to include all files listed in fetch.txt. An MD5 checksum may not have been calculated and saved in the sysmeta for a remote object, for example SHA256 may have been saved. The remote file would have to be downloaded and the required checksum calculated.

BTW - should the checksum algorithm be update to "SHA-256"?

Interestingly, here is the breakdown of DataONE checksum usage, with SHA-256 the most frequent:

https://cn.dataone.org/cn/v2/query/solr/?q=formatType:(DATA%20OR%20METADATA)&facet=true&facet.field=checksumAlgorithm&rows=0
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">112</int>
<lst name="params">
<str name="q">formatType:(DATA OR METADATA)</str>
<str name="facet.field">checksumAlgorithm</str>
<str name="rows">0</str>
<str name="facet">true</str>
</lst>
</lst>
<result name="response" numFound="2330070" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="checksumAlgorithm">
<int name="SHA256">257452</int>
<int name="SHA1">206933</int>
<int name="MD5">158680</int>
<int name="SHA-1">36847</int>
<int name="SHA-256">444</int>
</lst>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
<lst name="facet_intervals"/>
<lst name="facet_heatmaps"/>
</lst>
</response>

DataONE has the MNRead.getChecksum() service that will calculate any of the known checksum algorithms for a pid, but using that would put a dependency on the dataone package.

gothub avatar Sep 01 '20 19:09 gothub

This'd be nice to see. A few of points:

  1. A bag can have multiple payload manifests, one for each of the checksum algorithms used. So it's valid to have a manifest-md5.txt and manifest-sha256.txt in the bag.

  2. A bit of a hack but the LoC checksums list has an entry for unk (Unknown). So maybe we could just use that as a third manifest file, manifest-unk.txt and put some bogus value in for the checksum value? The BagIt spec only specifies:

    the checksum algorithm SHOULD be registered in IANA's "Named Information Hash Algorithm Registry

    so I think this might not be making the bag invalid.

  3. The docs for dataUrl say it's for lazy-loading of DataONE DataObjects and doesn't describe its use for the use case at the top here. Might be good to update that documentation to make it clear.

amoeba avatar Sep 01 '20 20:09 amoeba

@amoeba If we use multiple payload manifests, do we have to provide checksums for all objects in each manifest (e.g., for both MD5 and SHA-256)? Or can some objects be listed in MD5 and others in SHA-256 as long as each object is somewhere?

mbjones avatar Sep 01 '20 20:09 mbjones

Oh right. I misinterpreted what I read. Appears to be the former:

o Every payload manifest MUST list every payload file name exactly

:(

amoeba avatar Sep 01 '20 21:09 amoeba

As discussed at the dev meeting yesterday, an approach to resolve the checksum mismatch issue is to set a default check algorithm for a DataPackage. DataPackages can be downloaded and created using the following workflows, so the appropriate checksums need to be provided for BagIt serialization for each of these creating/composition methods:

  • downloading a package from DataONE with members that each have an existing checksum
  • downloading a package and updating with new package members
  • creating a new package composed of new package members
  • creating a new package composed of new and downloaded package members

These use cases can be fulfilled with the following changes:

  • for new DataPackages, specify the default checksum algorithm when the DataPackage object is created, which would override the default value:
dp <- new("DataPackage", checksumAlgorithm="SHA-256")
  • when new DataObjects are created, and algorithm could optionally be specified, which would override the default value:
do <- new("DataObject", ..., checksumAlgorithm="SHA-256")
  • when new DataObjects are added to a package, a check is made to ensure that the DataObject checksum matches the DataPackage checksum
  • when the DataPackage is serialized, the default DataPackage checksum is used to determine the algorithm to use for the checksum manifest.

Serializing downloaded packages is a bit more difficult, as a package might be composed of objects that may not all use the same algorithm. Therefore, I suggest that an algorithm be specified (or the default used) when downloading objects and package. These changes would be made to the appropriate 'dataone' package functions:

  • when downloading an object, check the algorithm in the sysmeta against the requested (or default). If these differ, then a request is sent to the associated MN to calculate the required value for the correct algorithm:
do <- getDataObject(d1c, ..., checksumAlgorithm="SHA-256")

or for an entire DataPackage:

dp <- getDataPackage(d1c, ..., checksumAlgorithm="SHA-256")

Updating these methods from the 'dataone' package is necessary for the case that objects are lazy loaded, where the data bytes for an object are not present locally, and may include content that is prohibitively large and should not be downloaded in order to calculate the checksum locally.

gothub avatar Sep 04 '20 21:09 gothub