openneuro
openneuro copied to clipboard
When querying a dataset for download, the server response should contain hashes of all files
Retrieving the JSON payload of an arbitrary dataset dataset_id
from
https://openneuro.org/crn/datasets/{dataset_id}/download
provides an object with datasetId
and files
properties. The latter is an array of objects that provides information on the files in the dataset, specifically:
-
id
, which is just the filename and the size in bytes, separated by a colon -
filename
, the filename -
size
, the size in bytes -
urls
, an array of URLs to retrieve the file from
Now, when downloading files from OpenNeuro, I very much would like to verify their integrity. (not to say that this is almost imperative 🙂)
Currently, the only way to check is ensuring the size of the downloaded file matches the size described in the metadata. But this is not a reliable test, as corruption may have occurred anyway. I'd like to propose to add a new property to the objects describing each file, which holds the hash (ideally SHA256): hash
or sha256
, so one can properly verify the file integrity after a download.
This download REST API is deprecated but in the spirit of this issue we should make sure the Draft and Snapshot GraphQL types include the correct hashes for git and annexed files. For git objects it does return the object hash but I think we're using an encoded version of file-path + size for annex objects as an old optimization that may need to be fixed to support a correct file content hash for all files.
What is the recommended way to programmatically retrieve versioned objects (files, directories) from a dataset?
Also, will the openneuro
CLI app continue to be supported?
What is the recommended way to programmatically retrieve versioned objects (files, directories) from a dataset?
Direct git access, GitHub mirrors for snapshots, or the GraphQL API are all recommended ways of getting the same information.
Here's a GraphQL example to retrieve file information for one subject from a snapshot:
query {
snapshot(datasetId: "ds000001", tag: "1.0.0") {
files(prefix: "sub-01") {
id
filename
size
urls
}
}
}
Also, will the
openneuro
CLI app continue to be supported?
Yes, it will be updated to use the GraphQL API.
Great, thank you for the explanation, @nellh!
Great, thank you for the explanation, @nellh!
No problem, please ask if there's documentation that would help. We're also aiming to provide more stability on the GraphQL side so it's less likely to suddenly disappear or change than the REST endpoints.
Hello @nellh,
I've been looking at the GraphQL schema and I'm not exactly sure which field is supposed to contain a hash for git
objects?
type DatasetFile {
id: ID!
key: String
filename: String!
size: BigInt
annexed: Boolean
urls: [String]
objectpath: String
directory: Boolean
}
Am I missing something?
Thanks,
Richard
Hello @ckrountree, I see you closed this issue, has this been resolved? How can one retrieve the hashes via the GraphQL API?
@hoechenberger OpenNeuro provides a content hash for every file now.
id is a hash of the path and content hash. This provides unique ids for files that appear in multiple paths within a dataset. key is the content hash (git-annex key but we provide the git hash for git objects now).
For annexed objects. key is the git-annex key. This is a string like MD5E-s{size in bytes}--{content hash}.{extension}
(example: SHA256E-s441399798--c476a833e3e13333bcee207cd5a624c1111eb927beea12234a1f82114438c9db.nii.gz
). This can be either MD5 or SHA256 if the dataset was created on OpenNeuro. The prefix identifies which one is in use, see git-annex for all possible backends.
For git objects, key is the 40 character git object hash.
OpenNeuro 4.12.0 (upcoming release) has tree objects as well, these are identified with directory: true
and key is always null.