Support validation of git-annex content available in an S3 remote
For OpenNeuro to host datasets with remote annexed content, we should support a validation mode that can access (or skip) that remote content as needed.
yes! alternatively/complimentary -- could there be some generalization so we could validate some "manifest" structure which would contain e.g. list of filenames, some of them with content (.json or .tsv) and/or URLs for those files online, so could be accessed via smth like https://github.com/fsspec/ (in python - very easy).
That would allow for BIDS validation across archives where data might be on S3 or other HTTP urls, or local -- but then all going through the same interface.
Some notes I was making with @rwblair:
remote.log
<uuid> [<key>=<value>]...
keys of interest:
name
type (S3)
publicurl
timestamp
*.rmet
Path: {md5(key)[0:3]}/{md5(key)[3:6]}/{key}.log.rmet
Contents:
<timestamp> <uuid>:V +<version>#<path>
Key: <hashname>-s<size>--<hash>.<ext>
Logic
stat file
if found:
use local opener
if not:
readlink -> (../)*.git/annex/objects/*/*/{key}/{key}
determine git root
load remotes by UUID (git-annex:remote.log)
read rmet (git-annex:{md5(key)[0:3]}/{md5(key)[3:6]}/{key}.log.rmet)
Construct URL
Linking in https://github.com/bids-standard/bids-validator/pull/280, which was a refactor that laid some groundwork for this effort. Next step is to start pulling pieces in from OpenNeuro's isomorphic git-based implementation.