openneuro icon indicating copy to clipboard operation
openneuro copied to clipboard

When querying a dataset for download, the server response should contain hashes of all files

Open hoechenberger opened this issue 4 years ago • 7 comments

Retrieving the JSON payload of an arbitrary dataset dataset_id from

https://openneuro.org/crn/datasets/{dataset_id}/download

provides an object with datasetId and files properties. The latter is an array of objects that provides information on the files in the dataset, specifically:

  • id, which is just the filename and the size in bytes, separated by a colon
  • filename, the filename
  • size, the size in bytes
  • urls, an array of URLs to retrieve the file from

Now, when downloading files from OpenNeuro, I very much would like to verify their integrity. (not to say that this is almost imperative 🙂)

Currently, the only way to check is ensuring the size of the downloaded file matches the size described in the metadata. But this is not a reliable test, as corruption may have occurred anyway. I'd like to propose to add a new property to the objects describing each file, which holds the hash (ideally SHA256): hash or sha256, so one can properly verify the file integrity after a download.

hoechenberger avatar Dec 13 '20 09:12 hoechenberger

This download REST API is deprecated but in the spirit of this issue we should make sure the Draft and Snapshot GraphQL types include the correct hashes for git and annexed files. For git objects it does return the object hash but I think we're using an encoded version of file-path + size for annex objects as an old optimization that may need to be fixed to support a correct file content hash for all files.

nellh avatar Apr 15 '21 17:04 nellh

What is the recommended way to programmatically retrieve versioned objects (files, directories) from a dataset?

Also, will the openneuro CLI app continue to be supported?

hoechenberger avatar Apr 15 '21 17:04 hoechenberger

What is the recommended way to programmatically retrieve versioned objects (files, directories) from a dataset?

Direct git access, GitHub mirrors for snapshots, or the GraphQL API are all recommended ways of getting the same information.

Here's a GraphQL example to retrieve file information for one subject from a snapshot:

query {
  snapshot(datasetId: "ds000001", tag: "1.0.0") {
    files(prefix: "sub-01") {
      id
      filename
      size
      urls
    }
  }
}

Also, will the openneuro CLI app continue to be supported?

Yes, it will be updated to use the GraphQL API.

nellh avatar Apr 15 '21 18:04 nellh

Great, thank you for the explanation, @nellh!

hoechenberger avatar Apr 15 '21 18:04 hoechenberger

Great, thank you for the explanation, @nellh!

No problem, please ask if there's documentation that would help. We're also aiming to provide more stability on the GraphQL side so it's less likely to suddenly disappear or change than the REST endpoints.

nellh avatar Apr 15 '21 18:04 nellh

Hello @nellh,

I've been looking at the GraphQL schema and I'm not exactly sure which field is supposed to contain a hash for git objects?

type DatasetFile {
  id: ID!
  key: String
  filename: String!
  size: BigInt
  annexed: Boolean
  urls: [String]
  objectpath: String
  directory: Boolean
}

Am I missing something?

Thanks,

Richard

hoechenberger avatar May 16 '21 16:05 hoechenberger

Hello @ckrountree, I see you closed this issue, has this been resolved? How can one retrieve the hashes via the GraphQL API?

hoechenberger avatar Oct 22 '21 21:10 hoechenberger

@hoechenberger OpenNeuro provides a content hash for every file now.

id is a hash of the path and content hash. This provides unique ids for files that appear in multiple paths within a dataset. key is the content hash (git-annex key but we provide the git hash for git objects now).

For annexed objects. key is the git-annex key. This is a string like MD5E-s{size in bytes}--{content hash}.{extension} (example: SHA256E-s441399798--c476a833e3e13333bcee207cd5a624c1111eb927beea12234a1f82114438c9db.nii.gz). This can be either MD5 or SHA256 if the dataset was created on OpenNeuro. The prefix identifies which one is in use, see git-annex for all possible backends.

For git objects, key is the 40 character git object hash.

OpenNeuro 4.12.0 (upcoming release) has tree objects as well, these are identified with directory: true and key is always null.

nellh avatar Oct 04 '22 16:10 nellh