openneuro
openneuro copied to clipboard
Recent datasets in OpenNeuroDatasets aren't true datalad datasets
It looks like all OpenNeuro datasets added to the datasets repo on/after 21 October 2021 are not complete datalad datasets, where those added before then are. The non-datalad datasets have a git-annex
branch, but no .datalad
directory in the master/main branch, and specifically no .datalad/config
file that specifies the datalad dataset id
. See e.g. https://github.com/OpenNeuroDatasets/ds003778.
Operating on these datasets with datalad (cloning, getting data) is still possible, but other operations aren't , such as metadata handling with datalad-metalad
for which the dataset id
is required. It is possible to turn these datasets into true datalad datasets on the user side (i.e. after cloning from OpenNeuroDatasets), but then there wouldn't be a consistent and globally unique id
for the dataset unless that force create is committed and merged back into the master/main branch of the affected datasets.
Is it a specific decision to not turn new OpenNeuro datasets into true datalad datasets? And if so, is it possible to change that decision for future datasets? And/or to incorporate "fixes" for existing affected datasets?
I don't believe this was a specific decision, but was probably accidentally dropped in a refactor (https://github.com/OpenNeuroOrg/openneuro/pull/2286) that stopped using datalad internally.
I think the fix would just be to add the following to the init process, as well as just before snapshots to pick back up datasets that were missing this:
import uuid
def ensure_datalad_uuid(dataset):
config = Path(dataset) / ".datalad" / "config"
if not config.exists():
config.parent.mkdir(exist_ok=True)
config.write_text(f'[datalad "dataset"]\n\tid = {uuid.uuid4()}')
Is there anything else missing? Is UUID4 correct?
I don't believe this was a specific decision, but was probably accidentally dropped in a refactor (#2286) that stopped using datalad internally.
I think the fix would just be to add the following to the init process, as well as just before snapshots to pick back up datasets that were missing this:
import uuid def ensure_datalad_uuid(dataset): config = Path(dataset) / ".datalad" / "config" if not config.exists(): config.parent.mkdir(exist_ok=True) config.write_text(f'[datalad "dataset"]\n\tid = {uuid.uuid4()}')
Is there anything else missing? Is UUID4 correct?
I've implemented it for new datasets in #2626. I'll also add it for new snapshots in the case where it doesn't exist, since that's useful for these datasets where it is missing and any direct git uploads that don't have it.
@effigies @nellh thanks for looking into this so speedily. AFAICT: yes, datalad uses UUID4, and yes adding the id to a .datalad/config
file, e.g.
$ cat .datalad/config
[datalad "dataset"]
id = deabeb9b-7a37-4062-a1e0-8fcef7909609
should make the existing datasets compliant with what datalad expects a datalad dataset to look like.