openneuro icon indicating copy to clipboard operation
openneuro copied to clipboard

Recent datasets in OpenNeuroDatasets aren't true datalad datasets

Open jsheunis opened this issue 2 years ago • 3 comments

It looks like all OpenNeuro datasets added to the datasets repo on/after 21 October 2021 are not complete datalad datasets, where those added before then are. The non-datalad datasets have a git-annex branch, but no .datalad directory in the master/main branch, and specifically no .datalad/config file that specifies the datalad dataset id. See e.g. https://github.com/OpenNeuroDatasets/ds003778.

Operating on these datasets with datalad (cloning, getting data) is still possible, but other operations aren't , such as metadata handling with datalad-metalad for which the dataset id is required. It is possible to turn these datasets into true datalad datasets on the user side (i.e. after cloning from OpenNeuroDatasets), but then there wouldn't be a consistent and globally unique id for the dataset unless that force create is committed and merged back into the master/main branch of the affected datasets.

Is it a specific decision to not turn new OpenNeuro datasets into true datalad datasets? And if so, is it possible to change that decision for future datasets? And/or to incorporate "fixes" for existing affected datasets?

jsheunis avatar Jul 03 '22 06:07 jsheunis

I don't believe this was a specific decision, but was probably accidentally dropped in a refactor (https://github.com/OpenNeuroOrg/openneuro/pull/2286) that stopped using datalad internally.

I think the fix would just be to add the following to the init process, as well as just before snapshots to pick back up datasets that were missing this:

import uuid

def ensure_datalad_uuid(dataset):
    config = Path(dataset) / ".datalad" / "config"
    if not config.exists():
        config.parent.mkdir(exist_ok=True)
        config.write_text(f'[datalad "dataset"]\n\tid = {uuid.uuid4()}')

Is there anything else missing? Is UUID4 correct?

effigies avatar Jul 05 '22 19:07 effigies

I don't believe this was a specific decision, but was probably accidentally dropped in a refactor (#2286) that stopped using datalad internally.

I think the fix would just be to add the following to the init process, as well as just before snapshots to pick back up datasets that were missing this:

import uuid

def ensure_datalad_uuid(dataset):
    config = Path(dataset) / ".datalad" / "config"
    if not config.exists():
        config.parent.mkdir(exist_ok=True)
        config.write_text(f'[datalad "dataset"]\n\tid = {uuid.uuid4()}')

Is there anything else missing? Is UUID4 correct?

I've implemented it for new datasets in #2626. I'll also add it for new snapshots in the case where it doesn't exist, since that's useful for these datasets where it is missing and any direct git uploads that don't have it.

nellh avatar Jul 05 '22 19:07 nellh

@effigies @nellh thanks for looking into this so speedily. AFAICT: yes, datalad uses UUID4, and yes adding the id to a .datalad/config file, e.g.

$ cat .datalad/config
[datalad "dataset"]
	id = deabeb9b-7a37-4062-a1e0-8fcef7909609

should make the existing datasets compliant with what datalad expects a datalad dataset to look like.

jsheunis avatar Jul 10 '22 13:07 jsheunis