tedana icon indicating copy to clipboard operation
tedana copied to clipboard

Add dataset fetching module

Open tsalo opened this issue 4 years ago • 14 comments

Summary

Per today's call, we want to add a new datasets module (similar to nilearn.datasets) for fetching open datasets.

Additional Detail

We'll want to focus on minimally preprocessed open datasets, which we will upload to the OSF.

Some things to keep an eye on:

  • Managing the "tedana data" folder. I scavenged nilearn's data to do this for NiMARE, and the main issue is that nilearn's code is kind of complicated.
  • Ensuring standard interfaces across datasets. This shouldn't be a major problem if we stick to BDIS-ish datasets.
  • Maintaining good coverage may be difficult.

Next Steps

  1. Add preprocessed datasets to fetchable location (e.g., OSF).
  2. Add functions to new tedana.datasets module to grab those datasets.
  3. Tests.
  4. Documentation.

tsalo avatar Feb 19 '21 16:02 tsalo

How would everyone feel about using GIN instead of (or in addition to) the OSF?

tsalo avatar Feb 21 '21 18:02 tsalo

How would everyone feel about using GIN instead of (or in addition to) the OSF?

I don't know the GIN API very well. Have you used it successfully in the past to fetch data (outside of eg datalad) ?

emdupre avatar Feb 22 '21 02:02 emdupre

I... have not. I'm planning to use it for some phantom data soonish, but I haven't tried it out yet.

tsalo avatar Feb 22 '21 17:02 tsalo

Here's the OSF respository for the cambridge multi-echo data.

Some notes:

  • I don't know how nilearn.datasets figures out the download url, but I've included a manifest.json which pairs each file with its download link
    • It should be possible to just load the manfiest file as a dictionary, then filter whichever files are desired and download them. Though I haven't actually tried doing this to see if it works.
  • The minimally pre-processed bold series contain desc-partialPreproc in their file name
  • The brain mask in scanner-space is also included
  • The transforms needed to go from scanner-space to MNI152-space are included, in case that is of interest

notZaki avatar Feb 22 '21 18:02 notZaki

I like the idea of writing a fetching module that can filter out what we'd like from the JSON, but that could get a bit complicated depending on what we're trying to fetch. I certainly think it's worthwhile and practical to read in a JSON that has file-URL pairings and then download them into a specified directory though.

jbteves avatar Feb 24 '21 22:02 jbteves

Are there any other minimally-preprocessed datasets we can include in this effort?

tsalo avatar Mar 22 '21 18:03 tsalo

There is this one as well. I can run it through the same kind of pre-processing, but maybe @emdupre already has a minimally pre-processed version?

notZaki avatar Mar 22 '21 19:03 notZaki

That's a good point- we could target the available datasets on OpenNeuro (which would include the Dalenberg and Cohen datasets as well). I do have to run fMRIPrep on all four OpenNeuro datasets for a replication in the near future, but if anyone has the opportunity to do it before then, that would be amazing.

tsalo avatar Mar 22 '21 19:03 tsalo

The more I think about it, the more convinced I am that we should capture the full fMRIPrep derivative folder structure, so that users can grab the files they need (e.g., warps to standard space, brain masks, etc.). I think we could either use a series of keywords to filter a manifest of files (as I do in NiMARE's _fetch_database()) or we could have a series of "derivative group" keywords that would select common sets of files.

In terms of housing the derivatives, I've found that it's very annoying to manage folder systems on the OSF, although I've only done it manually in the past. One option would be to use a compatible storage solution (e.g., Google Drive) that can be integrated as an add-on to an OSF repository. This would provide nice OSF URLs for the files. Alternatively, we could store the data on G-Node GIN. @emdupre took a look and couldn't find an API, so we'd either have to add a dependency (☹️) or build URLs to fetch from based on a manifest stored in tedana. I do the latter to identify relevant NeuroQuery and Neurosynth files stored on Github, which has a similar URL structure to GIN.

tsalo avatar Oct 25 '21 15:10 tsalo

Hi everyone! Have there been any more steps towards collecting sources for processed derivatives of open multi-echo datasets? I know of the Cambridge one that @notZaki linked to above, but haven't come across any others yet. I'm building a datalad dataset to include raw and derived multi-echo datasets (see https://github.com/ME-ICA/multi-echo-data-analysis/issues/13) and would love to make it as comprehensive as possible.

jsheunis avatar Mar 15 '22 19:03 jsheunis

I'm in the process of fMRIPrepping Le Petit Prince for #785, and I've already done "Multi-echo masking test dataset" for #783.

Both Le Petit Prince and the DuPre dataset require a bit of fixing before I could preprocess them, though, so I wonder if we should use the updated or original raw datasets?

tsalo avatar Mar 15 '22 21:03 tsalo

I'm in the process of fMRIPrepping Le Petit Prince for https://github.com/ME-ICA/tedana/issues/785, and I've already done "Multi-echo masking test dataset" for https://github.com/ME-ICA/tedana/issues/783.

Great, thanks!

Both Le Petit Prince and the DuPre dataset require a bit of fixing before I could preprocess them, though, so I wonder if we should use the updated or original raw datasets?

Are the updates available on Openneuro or did you only adapt them locally? In the former case, I probably have the latest version already available in the datalad dataset. In the latter case, are you able to push the updates to OpenNeuro?

jsheunis avatar Mar 15 '22 22:03 jsheunis

I had to edit both of those datasets locally. Perhaps @emdupre can push the necessary changes to her dataset, but none of us has write permissions for Le Petit Prince.

@emdupre, I believe that the only problem with your dataset is that the T1w volumes need to be squeezed from 4D to 3D. Also, sub-16 only has partial physio data, but I didn't see a note about that in the README.

tsalo avatar Mar 16 '22 18:03 tsalo

@emdupre, I believe that the only problem with your dataset is that the T1w volumes need to be squeezed from 4D to 3D. Also, sub-16 only has partial physio data, but I didn't see a note about that in the README.

I'll check if my access is current and get this updated ! Thanks for the edits :pray:

emdupre avatar Mar 16 '22 21:03 emdupre