bids-validator
bids-validator copied to clipboard
Need exception for "Empty File error" for CTF datasets
The validator is now raising an error when empty files are encountered.
For the CTF system (MEG), it is customary to have a bunch of files in a directory ending with .ds
. Two of these files are BadChannels
and bad.Segments
and they are always written - even if no bad channels or bad segments are present. Thus, perfectly fine CTF data can contain empty files.
How should we deal with this? A special exception for empty files for BadChannels
and bad.Segments
files if they are found within a .ds
MEG directory?
Sounds reasonable to me.
Best, Chris
On Wed, Nov 14, 2018 at 12:33 PM Stefan Appelhoff [email protected] wrote:
The validator is now raising an error when empty files are encountered.
For the CTF system (MEG), it is customary to have a bunch of files in a directory ending with .ds. Two of these files are BadChannels and bad.Segments and they are always written - even if no bad channels or bad segments are present. Thus, perfectly fine CTF data can contain empty files.
How should we deal with this? A special exception for empty files for BadChannels and bad.Segments files if they are found within a .ds MEG directory?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bids-standard/bids-validator/issues/651, or mute the thread https://github.com/notifications/unsubscribe-auth/AAOkp9TGTPOYsH-FejCe0wzEfyPVOiZ0ks5uvH42gaJpZM4Yeh1F .
actually, how do we deal with all the empty files in https://github.com/bids-standard/bids-examples ?
These will all fail the validator as soon as we update the version in Travis: https://github.com/bids-standard/bids-examples/blob/d9ff5e0c021453002efb40ff4b128e563e88fcd5/.travis.yml#L1-L6
@chrisfilo
Perhaps as a fix for the examples in bids-standard/bids-examples, we can tell travis to always validate using:
bids-validator -c {"ignore": [99]}
to ignore the EMPTY_FILE error. However, we'd still need a list for CTF files BadChannels
and bad.Segments
Sounds reasonable for this particular context.
The empty file error for the examples has been fixed this way: https://github.com/bids-standard/bids-examples/pull/137
(should have opened a separate issue though ...)
Main issue of here (CTF data) still persists.
has there been any progress on this front @sappelhoff @chrisfilo ? I need this to progress on a BIDS reader for MNE
I am unsure, which lines to change. Potential candidates are:
https://github.com/bids-standard/bids-validator/blob/4fd11fbd7c562ea7e390a2c7c73e429d506681a0/utils/files/testFile.js#L43
I suggest adding logic along the lines (pseudocode):
if (stats.size === 0) and NOT (file.name.contains ('BadChannels' or 'bad.Segments')) ... raise empty file error
However, there is also:
https://github.com/bids-standard/bids-validator/blob/e5b3816f0ada8495c46f249b9f203f401ff51ec5/utils/files/validateMisc.js#L3-L6
I find this second code part quite hard to understand ... potentially we have to add something for both code parts.
okay I see. I will try to see if I can use the workaround:
bids-validator -c {"ignore": [99]}
for now
That workaround was a proposal that I made, it actually does not work right now ...
the -c
argument should be a path to a JSON file.
See: https://github.com/bids-standard/bids-validator/blob/e5b3816f0ada8495c46f249b9f203f401ff51ec5/utils/options.js#L18-L71
It should be easy to add some lines to check if the string can be converted to a valid JavaScript Object ... if yes: take that as a config ... else, try to use it as a file name ... if that doesn't work either: error
The more I look at it, I am getting the impression that the code is actually already intended to support JSON strings such as bids-validator -c {"ignore": [99]}
instead of only file paths such as bids-validator -c bidsconfig.json
Specifically, I am wondering, whether these lines were intended to capture JSON strings:
https://github.com/bids-standard/bids-validator/blob/e5b3816f0ada8495c46f249b9f203f401ff51ec5/utils/options.js#L42-L44
but it seems to me that these lines are never reached, because all command line inputs qualify as strings, ... or is this only reached if no config is specified, and the input is thus the default {}
as defined here:
???
Some guidance would be appreciated @chrisfilo @nellh
would this be a valid approach? Note that I am adding the first if clause
loadConfig: function(config, callback) {
if (config.startsWith('{') and config.endsWith('}')) {
callback(null, { path: 'config' }, JSON.stringify(config))
}
else if (typeof config === 'string') {
// load file
files
.readFile({ path: config })
.then(contents => {
callback(null, { path: config }, contents)
})
.catch(issue => {
callback([issue], { path: config }, null)
})
} else if (typeof config === 'object') {
callback(null, { path: 'config' }, JSON.stringify(config))
}
},
@sappelhoff Indeed it should work if you specify it like --config.ignore=99 --config.ignore=44
. Note you have to include at least two, since it expects an array for ignore configuration.
Thanks @nellh! I will make a PR to make this obvious in the documentation as well.
Hello! The .eeg file inside CTF .ds folders can also be empty. Could it also be added as an exception?
Hi @Moo-Marc - thanks for reporting this.
Regarding "valid empty CTF files", would the list be complete after we include the .eeg
file? Or are there some other potentially unknown candidates that we can include in one go?
I've also seen some backup files that are created by CTF software that can be empty: e.g. bad.segments.bak, or *.eeg.bak. I believe that's it for potentially empty files.
Note that I asked CTF which files are considered "required" as per their dataset standard, and none of those empty files are. So they could in principle be deleted. But they are still valid. The required files are: .meg4, .res4, .infods, .newds, .hc
It seems to me that you should use .bidsignore
in this case
Do you mean just for the .bak files? Note that they show up quite often, even for data that was not really processed in any way. Just renaming a dataset or adding eeg channel locations, or visualizing the raw data with the CTF viewer can generate these files. I even think they might sometimes be created by the acquisition software. So I consider them fully valid "raw" dataset files.
By the way, I searched the BIDS spec for the word "empty" and found nowhere that empty files were not allowed. May I ask where this is specified?
By the way, I searched the BIDS spec for the word "empty" and found nowhere that empty files were not allowed. May I ask where this is specified?
As far as I know, this is not explicitly documented in our BIDS spec. I suspect that the EMPTYFILE check was introduced to catch erroneously empty files.
Sounds quite nightmarish with all these files that can be valid and empty in CTF. We could of course simply extend the list here:
https://github.com/bids-standard/bids-validator/blob/b2f0dd3e6a37eb6caa2df0e0b16f535942636536/bids-validator/utils/files/validateMisc.js#L6-L8
We would add eeg
and all .bak
versions of these files
What are your opinions on this?
I'd suggest adding *.eeg and *.bak to the exception as an immediate step. But maybe there needs to be a broader discussion about the check itself.
I'd suggest adding *.eeg and *.bak to the exception as an immediate step
I think that adding a wildcard * would not be a good idea. We should really be able to pin down the exact file names. Otherwise we could run into problems with other data formats. for example the BrainVision format has a binary datafile that ends with .eeg
... so using *.eeg
would also catch these. And they should truly never be empty.
I was thinking about:
const exceptionMiscs = ['BadChannels', 'bad.segments', '.eeg', 'BadChannels.bak', 'bad.segments.bak', '.eeg.bak']
would that not be enough?
Sorry, .eeg is only the extension. The actual file name for the CTF .eeg file is for example: sub-0001_ses-0001_task-rest_run-01_meg.ds/sub-0001_ses-0001_task-rest_run-01_meg.eeg It must have the same name as the .ds folder (as discussed in issue 692). So if not a wildcard, then the check would need to be more complex.
I'm curious, was there at some point a decision that the BIDS validator would also validate the raw data formats themselves? It can obviously be useful, but seems to me to go beyond the primary objective and would require a lot more effort considering the plethora of file formats.
I think it's kind of easy in the MRI world because they agreed upon nifti as the accepted format. It's then useful for the validator to check if the header of nifti agrees with the sidecar json files. There is also a library to read nifti files in javascript.
Unfortunately, in the case of EEG/iEEG/MEG, a single file format did not really happen. The code in the bids validator was originally written for MRI data. But now it's become a challenge because there aren't equivalent javascript readers for these data types. I agree that this is a big maintenance burden.
But wouldn't it be easy to just edit the .bidsignore? I can see why it can be useful to hardcode it in the code, but this could easily become a problem if we had more exceptions in other file formats ...
I'd rather suggest not enforcing this empty file rule by default. I think it's fine from my perspective that this tool would not fully validate all data formats in all modalities. But then we shouldn't have any constraints on those files. For those specific modalities or file formats that you wish to validate further, then the empty file check could be applied on those only, as appropriate for those formats. I don't think users should have to add valid files for any accepted format to .bidsignore since they don't actually conflict with the BIDS standard. Does that seem reasonable?
I see two separate issues here:
-
Should the validator ignore .bak files?
-
Should the validator raise an error for empty files by default with some exceptions or only check a selected set of extensions for content?
1.@Moo-Marc you should probably open a new issue for this. I don't think this extension should be added to the specification since .bak files are temporary and redundant in nature, but there is a precedence in the validator to ignore certain file types (such as hidden dot-files).
2.There are two aspects of this. First the number of file types where empty files are perfectly valid vs the number of file types where empty files are a sign of some sort of an issue. I believe there are far fewer files types where empty file is a meaningful valid file. Second - the engineering cost of switching the existing default to the opposite (as in someone needs to contribute and test this code). Considering those two options I would support adding .eeg
to the list of extensions that are allowed to be empty.
1. Should the validator ignore .bak files
Thinking again, I would say yes. The validator should not treat files inside .ds separately because it's a multifile system whose internals are not detailed in the specification. Anything inside the .ds folder may be considered raw data (?) It seems more reasonable that the non-zero size rule should be only for folder as a whole not to individual files. In MNE too, we point directly to the root of the directory, not to individual files. Taking this view would remove the necessity to hardcode exceptions (and thus potential for bugs). What do you think?
The validator should not treat files inside .ds separately because it's a multifile system whose internals are not detailed in the specification.
Are .bak
files only expected to be in .ds
folders?
It seems more reasonable that the non-zero size rule should be only for folder as a whole not to individual files.
If you are talking about any folder and any file I disagree. An empty file such as sub-01_T1w.nii.gz
is definitely a sign of something going wrong with the dataset preparation. The current does catch such situation. What I do find sensible is to ignore the contents of .ds
folders.
I don't think users should have to add valid files for any accepted format to .bidsignore since they don't actually conflict with the BIDS standard. Does that seem reasonable?
yes, sounds reasonable. I would not suggest to resolve the present issue through .bidsignore
alone
What I do find sensible is to ignore the contents of
.ds
folders.
I agree with this. That would solve our CTF specific problems.
A separate suggestion that I'd find sensible is to make the EMPTYFILE error a warning
instead of an error.
One of these two suggestions could be a solution. I am in favor of ignoring the CTF .ds
folder until somebody comes along and implements a CTF-reader
in Node/JS, which can then take over the checking load :-)
Are
.bak
files only expected to be in.ds
folders?
Yes for CTF datasets. I agree with everything in the last post. If someone implements a CTF validator, then it could be used later on. I'll mention here that I have a Matlab CTF to BIDS converter, but I'm unfamiliar with JS.
If you are talking about any folder and any file I disagree. An empty file such as
sub-01_T1w.nii.gz
is definitely a sign of something going wrong with the dataset preparation. The current does catch such situation.
Sorry for not being clear. I meant specifically in the case of .ds folders. It's a bit like your .nii.gz file. One does not unzip them and check the size of the files inside
until somebody comes along and implements a
CTF-reader
in Node/JS
let's write a fif reader and an edf reader first for which code exists :)