bids-specification icon indicating copy to clipboard operation
bids-specification copied to clipboard

[ENH] Allow for - (dash) (and/or +) in <label>

Open yarikoptic opened this issue 5 years ago • 34 comments

I thought that we already had an issue filed for this, but may be it was just my comments in the google doc version of the BIDS specification, so filing anew. Please close it referencing original one if one already exists.

ATM <label> is as I formalized in yet to be finished PR is allowed to contain only alpha numerics. That makes it really difficult in some cases to place multiple entries (e.g. details for _acq- or abbreviate details of preprocessing in _desc- of BIDS derivatives). In ReproIn heuristic for heudiconv, where we do allow for - we just replace them with an ugly X as a separator to just stay compatible with BIDS.

I have been suggesting to extend the list of allowed characters in <label> to include at least - and possibly others (e.g., +) which otherwise should not introduce any backward incompatibility with BIDS 1.0 -- all BIDS 1.0 names would be compatible with new syntax allowing - in the <labels>.

The original concern (probably at least a year back) which precluded any action was a possible breakage of parsing BIDS filenames for BIDS-supporting software, and was suggested to be included in coming some time in the future BIDS 2.0.

N.B. I have failed to file and issue outlining plans for BIDS 2.0, so I initiated a new milestone (2.0.0) to which I am adding this new issue.

IMHO (and may be we should outline it somewhere) BIDS supporting applications should be coded adhering to Postel's principle which is driving TCP implementations to "be conservative in what you do, be liberal in what you accept from others.". And most likely, in the case of added -, they already are since they are more likely to be splitting based on _ and then on - to separate key from value instead of full rigid regular expressions limiting to the allowed characters (not so obviously) mandated by the BIDS specification. So the only catch here indeed might be splitting on - without restricting that only a single split should happen. That is where breakage could happen if - is added. The possibility of breakage in case of extending the list of allowed characters with + (or even : which I dislike to suggest due to messing up scp/ssh) unlikely to have negative ramifications.

Please share your thoughts/opinions

yarikoptic avatar May 09 '19 21:05 yarikoptic

I'm personally okay with - or +, and can't really remember the arguments against them.

effigies avatar May 09 '19 21:05 effigies

I'm okay with +, and not very fond of - but I understand it could be added. I would not allow - to be present in the suffix entity, though.

oesteban avatar May 09 '19 23:05 oesteban

I would not allow - to be present in the suffix entity, though.

Great point I haven't thought to exercise @oesteban ! -- I would not allow it as well, makes parsing much more difficult etc! The only spot I see where <label> is used not within _key-<label> is

(git)hopa:~/proj/bids/bids-specification[derivatives]git
$> git grep '_<label'
src/04-modality-specific-files/04-intracranial-electroencephalography.md:called `electrical_stimulation_<label>`, with labels chosen by the researcher and

so might need analysis/tune up (or not)

yarikoptic avatar May 10 '19 00:05 yarikoptic

ok, after https://github.com/bids-standard/bids-specification/pull/152 is merged in one way or another we could have a PR extending the list of allowed characters non-ambiguously in a single location ;-)

yarikoptic avatar May 10 '19 01:05 yarikoptic

Note that the request to include hyphens in key-value labels is currently in the "Suggestions for BIDS 2.0" Google doc:

Allowing hyphens in filename key values

which suggests, as I had recalled, that this was a settled issue. Including hyphens increases the complexity of parsing the filenames, was the reason I recall.

nicholst avatar May 10 '19 12:05 nicholst

@nicholst, will '-' really increase complexity of parsing?

Each key-value parsed by '_' and everything after the first '-' is a label.

nbeliy avatar May 10 '19 13:05 nbeliy

@nicholst Oh, that is where BIDS 2.0 "extensions" are discussed -- thank you for the link! I have added the link to the 2.0.0 milestone description. I think it might be worthwhile to move them all under issues over here at some point, so elderly folks like us could easily find everything relevant in the single location (here).

I would not say that parsing complexity would increase, but indeed it might require some adjustment.

yarikoptic avatar May 10 '19 14:05 yarikoptic

@nbeliy,

@nicholst, will '-' really increase complexity of parsing?

Each key-value parsed by '_' and everything after the first '-' is a label.

Just reciting the conclusion of a long thread about the issue. Like anything, this issue can be re-raised, but I think there is the basic issue of extra clarity from having the hyphen play exactly one role in file/directory names.

nicholst avatar May 11 '19 16:05 nicholst

my 1c: This issue by now is probably a few years old (IIRC I have suggested it possibly even during BIDS 1.x initial development/release) . If we acted sooner, it would be no issue but even those years back I have proposed it the main argument against was "we would break the tools" of which there were much fewer. So longer we wait to act, the harder would be to adopt this change.

yarikoptic avatar Aug 14 '19 14:08 yarikoptic

I think at this point the procedure would be sending a draft PR with the suggested changes for a more focused discussion, is that correct @franklin-feingold @sappelhoff ?

oesteban avatar Aug 14 '19 14:08 oesteban

I think at this point the procedure would be sending a draft PR with the suggested changes for a more focused discussion, is that correct @franklin-feingold @sappelhoff ?

Yes, and on top of that I think what this discussion needs are:

  1. concrete use cases where the proposed changes would significantly enhance BIDS (summaries/links to such cases)
  2. a concrete PR showing the diff of the spec prior and post the proposed changes (what @oesteban said)
  3. more participants to voice their opinions - especially developers

Currently this change seems to me as a minor improvement with the potential for a major confusion ... but I am very happy to be convinced otherwise :-)

sappelhoff avatar Aug 14 '19 14:08 sappelhoff

Use cases:

Subject or session labels come from long-established project that has identifiers with dashes.

Related, such a project uses identifiers with underscores (but not dashes), and the underscores need to be replaced; having dashes as an option for replacing is helpful.

nicholst avatar Aug 16 '19 14:08 nicholst

Is UTF8 allowed in BIDS file names, or if not, should it be? If so, there are many dashes to choose from which are not a minus (ASCII decimal 45). For example ‒ – and —. See https://www.w3schools.com/charsets/ref_utf_punctuation.asp

robertoostenveld avatar Nov 01 '19 16:11 robertoostenveld

Using alternatives that look like the proposed addition doesn't help much with readability, makes it harder. Imagine a suffix like _inplane‒T2.json (I used your first UTF8 character). _inplane+T2.json breaks the suffix a bit, but is still readable. The limitation of the suffix is also important in this kind of proposal, IMHO.

I would be fine with the + symbol and I think the dash is not worth the pain. Adding the dash would feel to me as useful as allowing spaces (i.e., error-prone and likely confusing).

oesteban avatar Nov 01 '19 22:11 oesteban

I think we should explicitly forbid UTF8 and limit to ASCII to avoid such "tricks" which could potentially cause confusion. So let's move forward with "+"?
I doubt any tool was so strict with regexp that it listed only alpha numerics. and sooner we change, less possible negative effect it would have. This issue already dragged long time (I originally suggested in the times of BIDS being a google doc), and I think it is worth to be accepted to 1.x series without awaiting 2.0 which I am not quite sure would come in any foreseeable future

yarikoptic avatar Nov 02 '19 00:11 yarikoptic

Agreed about limiting character sets, I also don’t look forward to multiple types of dashes or emojis in filenames. But the limit should be explicitly mentioned.

Regarding the “+” I am neutral.

Verstuurd vanaf mijn iPhone

Op 2 nov. 2019 om 00:57 heeft Yaroslav Halchenko [email protected] het volgende geschreven:

 I think we should explicitly forbid UTF8 and limit to ASCII to avoid such "tricks" which could potentially cause confusion. So let's move forward with "+"? I doubt any tool was so strict with regexp that it listed only alpha numerics. and sooner we change, less possible negative effect it would have. This issue already dragged long time (I originally suggested in the times of BIDS being a google doc), and I think it is worth to be accepted to 1.x series without awaiting 2.0 which I am not quite sure would come in any foreseeable future

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

robertoostenveld avatar Nov 02 '19 06:11 robertoostenveld

ok, it seems that the path of the least resistance would be to add + as allowed character. I will look into proposing a PR here and against bids-validator for that (unless someone would beat me to it which I would really appreciate)

yarikoptic avatar Dec 23 '19 17:12 yarikoptic

oh hoh, it had been a year! Few final remarks and then I will try to find time to compose a PR.

  • + implies composition of multiple things, like _proc-mc+filt -- thing was motion corrected and filtered; whenever - is more of a pure delimiter between multiple words which need to be taken together
  • re _inplane‒T2. example -- "illegit" since - must not be allowed in the suffix, which it must not look like _key-value. This issue only about <label> as used in the _key-<label>, and we never use _<label> as arbitrary suffix (git grep -e '<label>\.' -e '_<label>' comes with no relevant hits)
  • comparison of - to spaces is IMHO also a bit too far stretched. - does not break visual composition of the filename etc.
  • arguments against - because there are UTF8 characters for longer dashes are not really pertinent. Pretty much any character could have some confusingly similar symbol in UTF8, so we just must not allow utf8 in general.

Since I and others mentioned a number of use cases, and IMHO it would be easier to strip away than to add later, I will prepare a PR which would introduce allowing both - and + and then we will see how it goes.

yarikoptic avatar May 08 '20 15:05 yarikoptic

woohoo -- I found you my darling issue -- you have been moved and I thought that I just lost my "search foo" skill!

With the addition of BIDS URIs (https://github.com/bids-standard/bids-specification/pull/918) into BIDS specification (aiming for 1.8.0 release) I really want to make a case for this issue to be given yet another consideration:

  • BIDS URIs PR introduced a "breaking" new feature, while deprecating support for prior paths-based IntendedFor and requiring everyone to adjust to use bids:: in those cases. Operation of any tool now which uses IntendedFor and which was created for BIDS prior future 1.8.0 release would break now on a first BIDS 1.8.0 compliant dataset. As to me, that PR establishes a precedence to appeal for this issue/request since we are running into such use cases with quite some regularity and for which we have no good workaround besides "use X instead" or "come up with a shorter oneWord one" (hard it label is just numbers).
  • this is one of the oldest feature requests. If it was filed OVER 3 years ago. If it was timely handled back then, we would not even be speaking over this issue ever again.
  • It is good to have this repository with possible items toward BIDS v2 , but AFAIK there is no clear roadmap/timeline toward v2, so I do not see this issue being addressed in immediate future.
  • per discussion/analyses above allowing for [-+] in the <label> would not even guaranteed to break any tool right away: tool might be just as fine, users would need to come to make use of this "new" functionality (unlike with BIDS URIs - they have to switch to use them, so causing tools to break "earlier")
  • unlike BIDS URIs, adding support for having [-+] in the label should be trivial for any tool/language. And hopefully soon, as more tools use our BIDS schema, fix for this schema would be "automagically adopted".

Now I will really make an effort to file a PR but might need to prep few "preparatory" PRs first.

yarikoptic avatar Jul 25 '22 16:07 yarikoptic

I moved this back. Sorry, not sure exactly why it got moved; perhaps it seemed abandoned, or was lumped in with the BIDS-2 proposal linked above because of the - suggestion.

Reading through the thread, I think there's basically:

  • Rough support (as in some support but no opposition) for +
  • Opposition to - for having multiple meanings
  • Opposition to UTF-8 for deceptive characters

So I think a reasonable proposal is to change the regex for label to [a-zA-Z0-9+], and associated textual changes. It does not seem worth the time to write for both + and -, given that there's known opposition and - was specifically tabled before.

effigies avatar Jul 26 '22 17:07 effigies

While I have no objection to + other than that it looks ugly, - I think is a very bad idea. I think the main issue is not high-level tools such as validators, but basic tools such as user bash scripts and path legibility at a glance. Minus is a special character, and that is good. I would make an analogy to periods in file names, using periods to delimit words in a filename is a bad idea.

The genealogy of this request is also indicative of problems which BIDS could help solve by encouraging better data practices. This --desire generally crops up because operators will put the experiment name, animal name, animal genotype, session date, etc. in the name field to keep it all in one place. That is a poorly annotated experiment and not a string which is just “too fancy” for BIDS. So I think there is something to be said for explicitly not supporting minuses. If you need a minus you're trying to cram 2 concepts in a field, pretty much by definition.

TheChymera avatar Jul 26 '22 17:07 TheChymera

... using periods to delimit words in a filename is a bad idea

and thus we don't use periods in the middle of the filenames (although could), but we do use .tar.gz and others to compose extensions together. In other words, "allowed != always a good idea". But on contrary "disallowed == need to come up with uglier workaround" (.tgz?).

Although I agree that quite often desire to place - within a label as indicator of "underthinking" about separating metadata, it is not always the case and some cases do warrant "multi-word" label, e.g. within _acq- field to indicate e.g. different SMS and GRAPPA factors etc for which there is no (and never would be due to too small of a niche) dedicated entity. Situation becomes even more fun for _desc- in derivative data to provide some meaningful description.

Another analogy -- BIDS doesn't restrict the length of the <label> but nobody does super long labels. The same sanity I expect to be applied to use of - by the user.

yarikoptic avatar Jul 26 '22 19:07 yarikoptic

Although I agree that quite often desire to place - within a label as indicator of "underthinking" about separating metadata, it is not always the case and some cases do warrant "multi-word" label, e.g. within _acq- field to indicate e.g. different SMS and GRAPPA factors etc for which there is no (and never would be due to too small of a niche) dedicated entity. Situation becomes even more fun for _desc- in derivative data to provide some meaningful description.

Are we not talking about the 20% in the 80/20 principle? How many datasets would need such flexibility to specify acq-? This means, all other entities are the same so you can't disambiguate ?

I tend to agree with Chris' assessment, and, as I mentioned before, once + is allowed, it could be given tokenization semantics. That would turn - redundant as its sole proposed purpose is label tokenization.

oesteban avatar Jul 26 '22 20:07 oesteban

I agree with the usefulness of +, is there any update on this issue ?

neurorepro avatar May 16 '23 12:05 neurorepro

No update. Rereading this, I think my assessment remains the same as before: We can do this. Somebody needs to make a concrete proposal in the form of a PR. I have some new questions that I would like to see addressed in a proposal (possibly they came up and I skimmed too briefly):

  1. Can all labels include +, or should we have labels ([a-zA-Z0-9]+) and multi-labels ([a-zA-Z0-9+]+)?
  2. Is + allowed in index ([0-9]+) or just label?
  3. Can + be the first or last character of a label(/index)? (If no, the regexes will get much uglier.)
  4. Is this generally available or only in derivatives?
  5. Do we define any semantics? e.g., if you combine run-01 and run-02, is run-01+02 now acceptable? Previously derivatives said to remove the run entity. In the other direction, does run-01+02 imply any relationship to run-01 or run-02 per the standard? This will have implications on tooling: if I query layout.get(run=1), should I get run-01 and run-01+02?
  6. What does this mean for suffixes? These aren't "entities", so we could say this is just labels.

I would suggest the following answers, but I'm not 100% set:

  1. Split labels and multi-labels. Subject shouldn't have a multi-label IMO. Possibly others shouldn't as well.
  2. Just label. Indexes have numerical meaning and a list of numbers is less clear how to work with.
  3. Weak no. I could imagine a label like B1+ might be useful enough to justify permitting it.
  4. General.
  5. No semantics. It's just another character in a label for human readability. Tooling should not count on consumers to interpret multi-labels as compositions of discrete meanings. If detailed metadata needs to be communicated, it should be in JSON.
  6. Leave suffixes as a controlled vocabulary. On a case-by-case basis a suffix with a + can be considered.

effigies avatar May 16 '23 13:05 effigies

Thanks @effigies , I'm happy to help on the PR

neurorepro avatar May 16 '23 13:05 neurorepro

Are you @effigies to introduce some new semantic behind milti-label - Ie would have some meaning beyond just being a label?

yarikoptic avatar May 16 '23 14:05 yarikoptic

Hmm. Not intentionally (see 5). Possibly that argues against my position 1. I do think the end result should be logically consistent, however these questions are answered.

effigies avatar May 16 '23 14:05 effigies

The same sanity I expect to be applied to use of - by the user.

Ambitious hope. I think the lack of minus support is usually the point where people are introduced to metadata separation. I'm still explaining to one of our colleagues why the name of his mouse is not part of the session name.

I continue to think this makes BIDS less conducive to good metadata practices. If you really really want to allow something like this, maybe it should be made sufficiently prohibitive so as to give the user pause as to whether it's worth it. We could support some other character. Perhaps the cdot, , it's actually very aesthetic and distinctive from the minus (and very obviously not a period either). This is similar to an older suggestion by @robertoostenveld , except the cdot is visually distinct enough from the dash/minus family (em-dash , en-dash , and minus -) to not cause visual confusion.

The prohibitiveness of comes from most people not knowing where it is on their keyboards.

TheChymera avatar May 22 '23 14:05 TheChymera

@effigies could you confirm that here https://github.com/bids-standard/bids-specification/issues/1165#issuecomment-1549699835 you are referring to only + ? I believe one way of pushing this forward faster is dropping - from proposals now, see + added and if someone is really interested in also permitting - then iterate on a new PR.

Another weak idea regarding 3 (and trying to get regexes even uglier) is: should two + be allowed? Something like: acq-O++O.

Overall, I totally agree with @effigies' assessment, but also agree with @yarikoptic on why this should get added semantics (e.g., allow for subject labels).

oesteban avatar May 23 '23 07:05 oesteban