bids-specification icon indicating copy to clipboard operation
bids-specification copied to clipboard

ENH: Proposal for `model-<modelname>` directory

Open arokem opened this issue 3 years ago • 31 comments

As an auxilliary to work being done on BEP16 (DWI derivatives), we are proposing a change to the common derivatives specification to allow the creation of a sub-directory into which model parameters will be saved, together with a models.tsv metadata file to describe all the different models that are fit within a pipeline.

EDIT BY @oesteban We arrived to this proposal as a way of partially address the issues identified by @Lestropie (see bids-standard/bids-bep016#50), building from his proposal n. 3.

arokem avatar Sep 12 '22 20:09 arokem

Codecov Report

Patch and project coverage have no change.

Comparison is base (8bc376e) 87.90% compared to head (104abd5) 87.90%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1280   +/-   ##
=======================================
  Coverage   87.90%   87.90%           
=======================================
  Files          14       14           
  Lines        1273     1273           
=======================================
  Hits         1119     1119           
  Misses        154      154           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov[bot] avatar Sep 12 '22 20:09 codecov[bot]

See https://github.com/bids-standard/bids-bep016/issues/50 for relevant context.

Specifically https://github.com/bids-standard/bids-bep016/issues/50#issuecomment-1244011022

arokem avatar Sep 12 '22 20:09 arokem

I haven't really integrated this whole proposal yet, but I want to jot down some quick notes about the direction we took in FitLins, which is at least superficially quite different. Not sure if it can apply here at all...

In FitLins, we have hierarchical models (e.g., run-level -> subject-level -> dataset-level), such that you could potentially have multiple dataset-level models that build off the results in a common subject-level model. Instead of nesting models more deeply inside modalities, I've been pulling them to the top:

fitlins_derivatives/
    dataset_description.json
    node-run/
        sub-01/
        sub-02/
        ...
    node-subject/
        sub-01/
        sub-02/
        ...
    node-group/
        group_results

I imagined the node-* becoming derivative datasets in their own right, and then fitlins_derivatives becoming a super-dataset (a concept still to be defined). This would give the flexibility to, say, distribute first-level results as a standalone dataset for image-based meta-analysis.

effigies avatar Sep 12 '22 20:09 effigies

@arokem If you want to fix up the markdown, use npm install remark; npx remark <markdown-file> -o.

effigies avatar Sep 12 '22 20:09 effigies

Thanks for the quick comments here @effigies! Are these proposals really in conflict? I can imagine having a model-<modelname> directory within every one of the node- levels that you specified. Seems complementary to me.

arokem avatar Sep 12 '22 20:09 arokem

@arokem If you want to fix up the markdown, use npm install remark; npx remark -o.

Thanks! Unfortunately, working in codespaces because UT's wifi won't let me pull stuff down from GitHub :-/

I'll let @oesteban do the rest of these fixes :-)

arokem avatar Sep 12 '22 20:09 arokem

@effigies

$ npm install remark; npx remark src/05-derivatives/01-introduction.md -o src/05-derivatives/01-introduction.md
npm WARN saveError ENOENT: no such file or directory, open '/Users/oesteban/workspace/bids-specification/package.json'
npm WARN enoent ENOENT: no such file or directory, open '/Users/oesteban/workspace/bids-specification/package.json'
npm WARN bids-specification No description
npm WARN bids-specification No repository field.
npm WARN bids-specification No README data
npm WARN bids-specification No license field.

+ [email protected]
updated 1 package and audited 94 packages in 0.764s

42 packages are looking for funding
  run `npm fund` for details

found 0 vulnerabilities

npx: installed 55 in 2.831s
command not found: remark

oesteban avatar Sep 12 '22 20:09 oesteban

the npx remark command doesn't seem to work

oesteban avatar Sep 12 '22 20:09 oesteban

I ran it. Apparently it doesn't fix everything, though...

effigies avatar Sep 12 '22 20:09 effigies

I was wondering how this fits with the computational model BEP (if it's related/needs to)?

PeerHerholz avatar Sep 12 '22 21:09 PeerHerholz

I was wondering how this fits with the computational model BEP (if it's related/needs to)?

As per that BEP: BIDS Computational Model files store the mathematical and computational descriptions of computational models as well as simulation results. Based on that I'd say the only overlap would be the simulation results bit, and not entirely because this PR shies away from specifying models themselves. I believe that BEP has a very large overlap with the BIDS-statsmodel.

Instead, we want this PR to gauge the interest/position of the BIDS community in the possibility that we add the new model-<model_id>/ level to the directory hierarchy of derivatives.

There are some deeply related aspects of this PR that we intendedly left for the future:

  • Multi-datatype models (e.g., a model that generates derivatives from both func/ and dwi/ resources). We were entertaining ideas like allowing a folder func+dwi/ to exist, or have a more general multi/ possibility. In any case, we did not want to get into that issue here, but will come up almost immediately after this one is decided.
  • The actual specification of models. There's a pretty much developed proposal, very comprehensive, in BEP016 -- thanks to the persistence, attention to detail, and overall titanic effort of @Lestropie.

Those are critical elements that affect this PR, so it's good to have them in mind (and it's a likely possibility that a majority of people push towards having one or both issues covered here too, should it be going towards an accepted decision).

It'd be great to have critical reviews from @rwblair and @tsalo, and anyone else involved in the BIDS schema feat. Does this complicate things on that front?

Historically, this PR comes from bids-standard/bids-bep016#50. It does not itself fully resolve the inheritance issue described there, but it basically evolves from Option 3 there, which indirectly resolves the problem to the extent it is necessary within BEP016.

oesteban avatar Sep 13 '22 10:09 oesteban

@arokem @Lestropie I have allowed myself to remove the WIP label from the title. I have made some edits to use macros and make the text more consistent with the rest of the spec.

For a render of this PR at this point, go to https://output.circle-artifacts.com/output/job/ecd7481c-1064-41ad-a51b-e362395503e4/artifacts/0/dev_docs/01-introduction.html

oesteban avatar Sep 13 '22 15:09 oesteban

Dear all, by chance I stumbled on this BEP, and as I'm working on various BIDS-related software, I wanted to put my 5 cents.

My concern is sub-folders like dwi/model-DTI may confuse BIDS tools (and make difficult to develop new) as it deviates from standard BIDS naming convention (sub-<>_<entities>_<suffix>.<extension>). From implementation point of view we will need to add a lot of tests into dataset parcing in order to index such files.

I would propose to put models into separate data-type folder together with the sidecar json and summury tsv file:

sub-abc/ses-def/models/sub-abc_ses-def_mod-DTI_model.d
sub-abc/ses-def/models/sub-abc_ses-def_mod-DTI_model.json
sub-abc/ses-def/models/sub-abc_ses-def_models.tsv
sub-abc/ses-def/models/sub-abc_ses-def_models.json

This way the model data is introduced exactly the same way as any other data file in the dataset, simplifying parsing. The d extension is here just to point it's a folder. With carefull selection of entities (for ex for parameter adjustments) you can see which folder correspond to what model:

sub-abc_mod-DTI_par1-00_par2-00_model.d
sub-abc_mod-DTI_par1-00_par2-01_model.d
sub-abc_mod-DTI_par1-00_par2-02_model.d

With convention that 00 means nominal/default value, and non-zero index the corresponding parameter adjustment. Or alternatively using one entity for different parameter sets with meaningful naming.

For the group level models models folder can be moved into root directory:

root/
    models/acq-DTI_model.d
    sub-abc/

Cheers, Nikita

nbeliy avatar Sep 14 '22 10:09 nbeliy

My concern is sub-folders like dwi/model-DTI may confuse BIDS tools (and make difficult to develop new) as it deviates from standard BIDS naming convention (sub-<>_<entities>_<suffix>.<extension>).

I don't think there's anything in this PR that is not supported by the current specifications. It's true that the BIDS Validator may require some extra heuristic, but that's more a problem of being able to formalize BIDS rules into a schema that can be used directly by the validator.

Since this only affects BIDS-Derivatives (and that's clearly stated at a couple of places in this PR) I don't think this would break any existing software either, basically because there's no specification to break at this point.

Finally, I'll also mention that this model-<label>/ folder is basically specified like sub-<label> + participants.tsv file or the ses-<label>/ folder. So it doesn't seem to break any existing conventions for folder names.

I would propose to put models into separate data-type folder together with the sidecar json and summury tsv file:

I think these suggestions fit better in #1282. Can you check that out?

Thanks a lot for the input, happy to clarify further if the argument is not sufficiently convincing.

oesteban avatar Sep 14 '22 14:09 oesteban

Since this only affects BIDS-Derivatives (and that's clearly stated at a couple of places in this PR) I don't think this would break any existing software either, basically because there's no specification to break at this point.

Indeed, the derivatives in BIDS are not regulated. But even so, deviating from BIDS style of names can provide difficulties to use such derivatives in the analysis scripts. If you take and example of bids-matlab (free publicity!), it allow to parse files even if they are not in the official BIDS schema. But the dwi/model-DTI folder will be ignores as it doesn't starts with sub-.

Imagine you want a script that compare different models inside the same dataset. If these models folders are supported, then you just need to query all files with suffix model, and you have all of them.

In opposite, you need to do recursive search for the folders containing model-<name>, and then parse the subject/session names retrieve corresponding images. It's feasible, but requires additional code.

I'm speaking less from human and more from machine perspective. Bids software know how to deal with bids-formatted names and structure, so any deviations will add just more complexity.

I think these suggestions fit better in https://github.com/bids-standard/bids-specification/pull/1282. Can you check that out?

My point was about model-<name> sub-folder, not about the name of suffix. The example I give can be changed in suffixes, entities and extensions. But I fight for data-type folder and sub- prefix)

nbeliy avatar Sep 14 '22 14:09 nbeliy

But even so, deviating from BIDS style

How this proposal deviates from BIDS style specifically?

If you take and example of bids-matlab (free publicity!), it allow to parse files even if they are not in the official BIDS schema. But the dwi/model-DTI folder will be ignores as it doesn't starts with sub-.

BIDS should remain agnostic to the decisions made by the software developer. This idea comes from the fact that scientists use folders typically to cluster the outputs of important models in their analysis. They typically group by subject first, and that's why sub-<label> is at the top. It seemed logical to offer the opportunity to encode models this way. That said, you don't need to have a model directory if you don't want (the same way you may not have the session folder).

Imagine you want a script that compare different models inside the same dataset. If these models folders are supported, then you just need to query all files with suffix model, and you have all of them.

Human: goes to the _models.tsv file and sees all the models with relevant info. Software: e.g., pybids could do this tomorrow with a simple edit of the bids schema inside:

>>> layout.get_models(datatype="func")
[ list of all model identifiers under datatype func ]

oesteban avatar Sep 14 '22 15:09 oesteban

From BIDS/common principles

For a data file that was collected in a given session from a given subject, the filename MUST begin with the string sub-

Note that sub-

Folder model-<name> is collected from one subject (being placed in sub- folder) but don't contain sub- entity.

Software: e.g., pybids could do this tomorrow with a simple edit of the bids schema inside:

But if it works without any edit, isn't better? If all models for a subject are collected in datatype folder, with same naming conventions, they could be treated as normal data and retrieving with simple query:

layout.get_data(datatype="models", type="func")

I don't use pybids so I don't know how to query individual entities.

Anyway it's just a suggestion, to take or to leave.

nbeliy avatar Sep 14 '22 15:09 nbeliy

From BIDS/common principles

For a data file ...

This is not a data file, this is a folder. Folders do not apply the entity concatenation principle.

But if it works without any edit, isn't better?

Your proposal (or any derivative of it) will require a change in PyBIDS of exactly the same scale of changes as the one here (because you are proposing new suffices). From this perspective, there's no difference, IMHO.

I don't use pybids so I don't know how to query individual entities.

I'm not saying here we should cater to any specific project -- I don't use bids-matlab so for me the argument goes the other way around.

oesteban avatar Sep 14 '22 16:09 oesteban

Making "model" a datatype is an interesting suggestion. It might cause more ambiguity than it solves, but let me follow that train of thought, if I can maybe poke and prod at the concerns about BIDS "conformity" in the most general sense.

You can conceptualise the current filesystem structure as following this hierarchy:

  1. Single-entity-based directories (sub-<label>, ses-<label>);
  2. One non-entity directory (datatype);
  3. Files (must replicate entities from 1. and have a suffix).

What's being proposed can violate this structure in a couple of different ways:

  1. Having eg. model-<label>/ as a sub-directory of eg. dwi/ breaks the ordering between 1 and 2;
  2. Having directory names replicate entities from parent directories has no precedent.

Point 1 is what is resolved by having model as its own datatype. Point 2 is a consequence of us going for Option 3 in https://github.com/bids-standard/bids-bep016/issues/50.

Hoping that might help rather than hinder seeing where each side is coming from.

Lestropie avatar Sep 21 '22 09:09 Lestropie

hi @oesteban @Lestropie @arokem @PeerHerholz @poldrack @adelavega @effigies

I finally had the time to look at all these proposals organized by @Lestropie I am grateful for that.

Also, I looked over what @oesteban has proposed here, thank you @oesteban.

Furthermore, @PeerHerholz pointed out that the term model (and 'models') is already being used in the Stats Model Specification and by the Computational Models which impinge on each others.

I am starting from the principles and on the fact that we should not fix BIDS but develop a description for connectivity data: Less is more target easier and doable goals, finish less but get it done!

The 80/20 rule tackle solutions that address 80% of most commonly used cases

Identify & focus on

  • core principles not extreme/rare cases
  • main ideas not specific details
  • creating the backbone of the BEP, avoid focus on theoretical aspects

It seems to me that we are not going to find a solution that will not have problems. Each solution will either require changes to BDIS, or (many of them) break the inheritance principle, or overlap with other datatypes or the Stats Model Spec.

I would like to suggest that instead of trying to solve this problems, the 8 of us. We advance with a BEP. We pick one option that will have problems, that it will not be perfect and bring the option to the wider community.

I also suggest that we bring it with the hope that the broader community accepts it. This is because we can gain much from having a poor BIDS spec on this and little from continuing to find a perfect way.

I am aware that what I am writing means that we will most likely later need to make changes. Yet, I would love to use the 80/20 or even a 70/30 rule here and just move forward with a proposal that will be sent to the community for feedback.

Looking at all the proposals described by @Lestropie here and the option described by @oesteban here, we attempted to make another example for functional model here

The solution seems to be OK to us. Very similar to what a Stats Model Specification but reasonable, not perfect. This approach should also allow to rewrite the Dimensionality reduction network and not affect the Connectivity Matrix description which uses an approach similar to the original DWI BEP16 approach.

I am sure we can find cases in which this fails. But 7030 ok to y'all? Can we keep this one and send it to the community, please?

francopestilli avatar Sep 21 '22 23:09 francopestilli

@francopestilli : just making sure that I understand what you are suggesting: you want to incorporate the content of this PR into the BEP16 repo and close it here? Basically, kicking the can of the discussion on the different alternatives that @Lestropie detailed in the google doc to the discussion on the BEP? Is the idea that the other BEPs that depend on this decision wait to see how BEP16 shakes out?

arokem avatar Sep 22 '22 04:09 arokem

I would suggest we should not merge it with BEP 16, but I also don't think we can make this decision in haste, as it has potential implications.

Alternatively, this becomes its own BEP, and BEP16 and related efforts "depend" on the model BEP. That would give people more time to review this proposal without necessarily holding anything up. (I say this not having had the time to give this a thorough review yet)

adelavega avatar Sep 22 '22 04:09 adelavega

I would suggest we should not merge it with BEP 16, but I also don't think we can make this decision in haste, as it has potential implications.

+1 to every word here

That would give people more time to review this proposal without necessarily holding anything up.

I agree this is critical too, in both arguments (i) allow time and (ii) without holding up.

Furthermore, @PeerHerholz pointed out that the term model (and 'models') is already being used in the Stats Model Specification and by the Computational Models which impinge on each others.

I agree "model" is very heavy-loaded and likely to find opposition. Stealing from @effigies' comment, I think this PR would still be a good idea if we changed the word with node (as for a node in the computational graph). Alternatively, I've thought of step (which I don't love), set (I prefer node, but don't dislike it).

I know @effigies' nodes have a different meaning, but it seems that we were leaning towards level- for the semantics Chris wrote above. Level has the benefit of having been used in the BIDS-Apps paper/spec before. An alternative to level from the CS world could be stage (stage-run, stage-subject, stage-group) ... but that all would fall out of this PR.

Would you see node as a better term for this proposal? Something like:

└─ <pipeline_name>/
   └─ sub-<label>/
      └─ <datatype>/
         ├─ node-<label>/
         │  ├─ node_description.json 
         │  └─ <node files> 
         ├─ sub-<label>_nodes.tsv 
         └─ sub-<label>_nodes.json 

@arokem, I believe @francopestilli's message was more along the lines that "this PR is fine, and seems to have stuck in some other places, but let's be practical and make more concrete advances within BEP016 in particular (and other BEPs) in general". I believe @adelavega's comments would be consistent with that, but stated in more specific terms.

If node- felt less problematic, I can update the PR accordingly.

oesteban avatar Sep 22 '22 07:09 oesteban

+1 for node. I think model is a very loaded word with more conceptual implications, whereas node can be more generic.

adelavega avatar Sep 22 '22 14:09 adelavega

We also are using node in the sense of a node in a computational graph, and I am for using it that way here. (As an aside, I note that BEP 017 is using it to refer to nodes in a connectivity graph defined by a connectivity matrix...)

If you are speaking of computational graphs and not simply collections of models that you happen to want to co-run, it might be worth revisiting the hierarchy here, as this is exactly the situation that led us where we're going in FitLins. With the BIDS Stats Models graph, there is an I/O layer between nodes, and any node could be terminal and produce a derivative. Therefore, it ended up being cleanest for us to treat the output of each node as a fully-fledged derivative that then is the input to the next node.

Would it make sense in the DWI context to move node- up to the top level? And I suppose this is a question for the computational models BEP leads as well. I could imagine something like:

model_derivs/
  node-A/
    dataset_description.json  # `DatasetType: derivative`
  node-B/
    dataset_description.json  # `DatasetType: derivative`
  node-C/
    dataset_description.json  # `DatasetType: derivative`
  dataset_description.json  # We might want a new `DatasetType`
  graph.json  # Describes the computational graph

Each of the node directories could be distributed on its own as a derivative.

Getting ahead of myself, but we could also imagine a graph.json that shows the connection of other derivatives, such as:
derivatives/
  fmriprep/
    sourcedata/
      raw-bids-dataset/
      freesurfer/
  fitlins/
    node-run/
    node-subject/
    node-group/
    graph.json
{
  "Nodes": {
    "raw": "../fmriprep/sourcedata/raw-bids-dataset",
    "preproc": "../fmriprep",
    "run": "node-run",
    "subject": "node-subject",
    "group": "node-group",
  },
  "Edges": [
    {
      "Destination": "run",
      "Sources": ["raw", "preproc"],
    },
    {
      "Destination": "subject",
      "Sources": ["run"],
    },
    {
      "Destination": "group",
      "Sources": ["subject"],
    }
  ]
}

effigies avatar Sep 22 '22 16:09 effigies

@francopestilli : just making sure that I understand what you are suggesting: you want to incorporate the content of this PR into the BEP16 repo and close it here? Basically, kicking the can of the discussion on the different alternatives that @Lestropie detailed in the google doc to the discussion on the BEP? Is the idea that the other BEPs that depend on this decision wait to see how BEP16 shakes out?

Hi @arokem actually I went over each one if the options @Lestropie proposed in the Google doc. I added also examples at the bottom of the doc of how this proposal would save func data.

I was not able to find a solution in any of the options @Lestropie reviewed and presented. Did i get that wrong? Was there any option in the doc that was a potential solution / a way forward?

Just to clarify i did follow everything rob did

francopestilli avatar Sep 22 '22 16:09 francopestilli

We also are using node in the sense of a node in a computational graph, and I am for using it that way here. (As an aside, I note that BEP 017 is using it to refer to nodes in a connectivity graph defined by a connectivity matrix...)

I think we are talking about different nodes in a computational graph. I'm fairly confident that your node-run, or node-subject have internally more nodes of a computational subgraph inside. Here, node is used as an atomic entity that doesn't have a computational sub-graph underneath (obviously, this is mostly theoretical and arbitrary because these models will certainly be decomposable in smaller pieces).

Regarding the overlap with the connectivity BEP - I have less arguments to maintain node. Perhaps, we could go back to the root of the PR and, instead of enabling these folders to encode computational nodes (or a particular model's outcomes), think of these as a utility for the user to create "results sets" (which is the common practice of researchers this BEP is trying to enable):

└─ <pipeline_name>/
   └─ sub-<label>/
      └─ <datatype>/
         ├─ set-<label>/
         │  ├─ resultset_description.json 
         │  └─ <resultset files> 
         ├─ sub-<label>_sets.tsv 
         └─ sub-<label>_sets.json 

As you can see, I'm shying away from putting this higher in the hierarchy because that will open another can of worms. A less voluminous problem we intendedly avoided was multi-datatype analyses. It seems reasonable to propose allowing set-<label> folders as siblings of <datatype>/, having a new multi/ datatype folder where these resultsets can be stored or allowing composition <datatype1>+<datatype2>/. However, that is left for future PRs and debate or otherwise this one might not see a decision ever. For the same reasons, I would avoid stating in this particular PR that the folders can be used at any level of the tree structure.

oesteban avatar Sep 22 '22 16:09 oesteban

@arokem

Re "the other BEPs that depend on this" it looks like we have had several of the community discussing this issue. Would it be fair to hope that this can be accepted quickly?

-- Yes, we we would have BEPs waiting for this and one BEP (the dimensionality reduction one) has already used the 'model-' specification (it would need to be updated to the final version though).

francopestilli avatar Sep 22 '22 16:09 francopestilli

BEPs waiting for this and one BEP (the dimensionality reduction one) has already used the 'model-' specification (it would need to be updated to the final version though).

I would encourage everyone to try not to need this feature (in other BEPs) -- I see it more as a resource to the user than as a key to unlock the progress at the complicated points of other BEPs. Otherwise, we risk rushing important changes that we will not be able to roll back (backward compatibility) and I agree with @adelavega that we do want the input of as many as possible before proceeding with this idea.

oesteban avatar Sep 22 '22 17:09 oesteban

I like "result" instead of "model" ("resultset" also fine!), to avoid the already crowded space around that word. I think that "node" might confuse both because of the overloading of the term with analysis of connectivity, and because I suspect that there are many that don't think of the results they obtain in terms of nodes in a computational graph, even if that's a technically accurate description (I don't usually, unless someone/something forces me to).

@francopestilli : I am still not sure what you suggest. That we move on with the proposal in this PR? The example you added at the bottom of the google doc is not based on this PR, but based on a different variety of this idea ("alternative 3"), which does not introduce the new "results_description.json".

IIUC, one difference between "alternative 3" and the current PR is whether result/resultset metadata are stored in multiple files that are all called "resultset_description.json" (or similar) or whether we require that this be saved in files that have different names. Though formally I think that this is a minor difference, and both seem acceptable to me, I think that there might be some practical benefits to varying names across results/resultsets in minimizing confusion.

So, with minimal changes from the proposal in https://github.com/bids-standard/bids-specification/pull/1280#issuecomment-1255269988 + some additional variations (adding a suffix, pulling it up one directory, to avoid the inheritance conflicts):

└─ <pipeline_name>/
   └─ sub-<label>/
      └─ <datatype>/
         ├─ set-<label>_<suffix>/
         │  └─ <resultset files> 
         ├─ set-<label>_<suffix>.json 
         ├─ sub-<label>_sets.tsv 
         └─ sub-<label>_sets.json 

Could potentially go even further:

└─ <pipeline_name>/
   └─ sub-<label>/
      └─ <datatype>/
         ├─ sub-<label>_set-<label>_<suffix>/
         │  └─ <resultset files> 
         ├─ sub-<label>_set-<label>_<suffix>.json 
         ├─ sub-<label>_sets.tsv 
         └─ sub-<label>_sets.json 

Where the sub-<label> entity could then further down the line be replaced by a group-<label> entity if that makes sense, but that's probably for a future PR.

EDIT: Edited to remove one option that I had here previously, because it may run into inheritance issues (again...).

arokem avatar Sep 22 '22 19:09 arokem