Versioning interaction of bids-validator and OpenNeuro
I would like to open a discussion about bids-validator versioning and its impact on data providers/data analysts, particularly with respect to OpenNeuro. (This may not be the correct forum, but I would be interested in your thoughts.) Here is my understanding of the current situation:
-
As the bids-validator transitions to validation using schemas, more things are being checked and datasets that previously validated may stop being valid. (Note: while this is annoying for the data set owners, it is a good thing for analysts who need to rely on datasets actually following the specification.)
-
OpenNeuro has become the go-to place for depositing BIDS datasets. The version of the bids-validator OpenNeuro uses is a couple of weeks behind the current release of the bids-validator. (Note: this is also a good thing, to make sure no errors show up in the validator before things affect OpenNeuro, but can be annoying to data set owners who just validated their data using the latest release of the bids-validator and find out that their data set is rejected by OpenNeuro.)
-
When a new release of the bids-validator is accepted on OpenNeuro, OpenNeuro slowly progresses through the deposited datasets and re-validates them. Thus, a data set that once validated now shows errors (Note: this is a good thing for data consumers, who want to be sure that the data they are using is correct. However, it is annoying to the data set owners if they happen to look at their data set after it has been deposited. It is also confusing for data consumers who wonder if they should try to download this "erroneous" data.)
-
Some of the errors that pop up during the OpenNeuro revalidation process are very minor. The OpenNeuro staff may go in and fix a comma or two. (Note: this is a problem for the data owners who think the version that they uploaded to OpenNeuro is actually the data that is there. They aren't notified of the change and so have no way of knowing. A similar problem occurs for data consumers who have already downloaded the data, run into a problem and don't know why the copy on OpenNeuro is different than the one they have stored locally.)
PROPOSAL:
- Make the version of the bids-validator that OpenNeuro uses very visible on the web-site (maybe in the FAC).
- When a data set is deposited on OpenNeuro, put a line on the data set's page saying what version of the bids-validator was used to originally validate the data.
- Make no changes to the deposited data, however trivial, without putting a note on the data set's web page indicating what changes were made to which files. Create a new release of the data on OpenNeuro if the change was significant.
- Where possible, contact the data authors if a change is needed to remove errors.
- On the
bids-validatorside, as authors do pull-requests, maybe they could include a statement about how this PR affects validation of the data. This might be used to provide slightly more comprehensive release notes that are expressed more in terms of what the changes imply for the users of the validator.
3. When a new release of the bids-validator is accepted on OpenNeuro, OpenNeuro slowly progresses through the deposited datasets and **re-validates** them. Thus, a data set that once validated now shows errors (Note: this is a good thing for data consumers, who want to be sure that the data they are using is correct. However, it is annoying to the data set owners if they happen to look at their data set after it has been deposited. It is also confusing for data consumers who wonder if they should try to download this "erroneous" data.)
To clarify here, OpenNeuro does not currently revalidate automatically. New changes are validated with the bids-validator release deployed with OpenNeuro but existing versions and snapshots are not revalidated. Recently we've done a one time run re-validating only the most recently created snapshots to improve search discoverability for new search features that depend on validation output. Otherwise revalidation is sometimes done in response to validator issues affecting a specific dataset (such as validator crashes that were resolved in a later release). I don't think we'll automate rerunning it but we would like to give authors an obvious and easy way to do so.
One other idea here is we can show the output from each version. We do keep each run from a different version of the validator.
Make the version of the bids-validator that OpenNeuro uses very visible on the web-site (maybe in the FAC). When a data set is deposited on OpenNeuro, put a line on the data set's page saying what version of the bids-validator was used to originally validate the data.
Long overdue to include this, here's a ticket describing how we plan for it to work: https://github.com/OpenNeuroOrg/openneuro/issues/1491