bids-specification icon indicating copy to clipboard operation
bids-specification copied to clipboard

Validator MUST NOT accept identical files under different extensions

Open arnodelorme opened this issue 5 years ago • 6 comments

This BIDS dataset contains both .edf and .bdf file (which are very small)

https://openneuro.org/datasets/ds002034/versions/1.0.1

sub-01/ses-01/eeg/sub-01_ses-01_task-offline_run-01_eeg.edf sub-01/ses-01/eeg/sub-01_ses-01_task-offline_run-01_eeg.bdf

I believe it should not have passed the validator since there are 2 types of binary files and the BDF file is obviously corrupted.

arnodelorme avatar Nov 05 '20 05:11 arnodelorme

Thanks for the report @arnodelorme, it seems like you're going through a lot of datasets these days :-)

I agree that the validator should catch these cases. A given EEG file such as sub-01/ses-01/eeg/sub-01_ses-01_task-offline_run-01_eeg.<ext> MUST NOT be present more than once through using different extensions <ext>.

sappelhoff avatar Nov 05 '20 08:11 sappelhoff

This BIDS dataset contains both .edf and .bdf file (which are very small): https://openneuro.org/datasets/ds002034/versions/1.0.1

sub-01/ses-01/eeg/sub-01_ses-01_task-offline_run-01_eeg.edf sub-01/ses-01/eeg/sub-01_ses-01_task-offline_run-01_eeg.bdf

I believe it should not have passed the validator since there are 2 types of binary files and the BDF file is obviously corrupted.

I haven't checked whether the BDF file is corrupted, but if it truly is, that raises another, already known, concern: We are not validating the contents of binary EEG files.

This problem is hard to solve, because we would need to implement data format readers in Javascript. So that the bids-validator can go into the files and check for their validity. Currently, this is already being done for NIfTI files (and only for NIfTI files).

I tried many months ago to implement a reader/validator for the BrainVision format using Javascript here: https://github.com/sappelhoff/brainvision-validator/ ... see also bids-standard/legacy-validator#475

However, I ran into problems integrating it with the bids-validator, because it runs both on the browser, and the CLI. --> and the "file access" API for the browser is significantly different and more complicated than accessing files from the CLI (or from programs written in Matlab or Python).

But I will open this post as a separate issue and we certainly should address it as soon as we have some resources available. (And with resources, I mean people who have expertise, energy, and time)

sappelhoff avatar Nov 05 '20 08:11 sappelhoff

In this issue, let's track our progress to prevent users from storing the same data under different extensions.

This should be some rule that:

  • IF a file sub-01/ses-01/eeg/sub-01_ses-01_task-offline_run-01_eeg.<ext> is present
  • AND is from the list LIST_OF_ACCEPTED_DATA_FORMAT_EXTENSIONS
  • then there MUST NOT be any other file with the same name and an ext from that list

sounds difficult but possible to implement.

sappelhoff avatar Nov 05 '20 08:11 sappelhoff

Yes, this sounds like a good rule.

On Nov 4, 2020, at 10:33 PM, Stefan Appelhoff [email protected] wrote:

In this issue, let's track our progress to prevent users from storing the same data under different extensions.

This should be some rule that:

• IF a file sub-01/ses-01/eeg/sub-01_ses-01_task-offline_run-01_eeg. is present • AND is from the list LIST_OF_ACCEPTED_DATA_FORMAT_EXTENSIONS • then there MUST NOT be any other file with the same name and an ext from that list sounds difficult but possible to implement.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

arnodelorme avatar Nov 05 '20 12:11 arnodelorme

There are two checks in the schema along similar lines for readme files and gzipped files: https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/checks/general.yaml

To generalize this rule we would need to figure out some way of encoding mutually exclusive extensions into the file name rules. Taking eeg as an example: https://github.com/bids-standard/bids-specification/blob/master/src/schema/rules/files/raw/eeg.yaml

The sets of extensions that can coexist with each other (but not the other sets present) are: [ [".edf"], [".vhdr", ".vmrk", ".eeg"], [".set", ".fdt"], [".bdf"] ]

@effigies I feel like we've talked about this before, but can't remember any conclusions.

I'm kicking this over to the specification to be figured out there, once a decision is made the validator can implement the interpretation of it.

rwblair avatar Sep 16 '25 18:09 rwblair

  • #1492

effigies avatar Sep 16 '25 18:09 effigies