bids-validator
bids-validator copied to clipboard
Support for parquet format and/or dataframe access
Related to https://github.com/bids-standard/bids-specification/issues/1792, which is on my mind again because of https://hbcd-docs.readthedocs.io/data_access/dataformats/tabulated/:
🔗 Note: Parquet Not Currently Supported by BIDS ▸
Please note that Parquet files are not officially supported by the BIDS specification. For NBDC datasets, we decided to add Parquet as an alternative file format to the BIDS standard TSV to allow users to take advantage of the features of this modern and efficient open source format that is commonly used in the data science community.
A large project like HBCD adopting parquet in addition to BIDS seems like an indication that this is a recognized hole in BIDS, and so I think #1792 is likely to move forward. The validator could be the biggest sticking point, so I want to get out ahead of it.
Potentially relevant projects:
- hyparquet: A pure-javascript library that may have good browser support and <10KiB increase in the payload.
- parquet-wasm: wasm bindings to the Rust
parquetandarrowimplementations. - apache-arrow: A library for working with parquet's underlying memory model (arrow). May also be useful for loading TSVs to the same data structures, ensuring unified treatment if we do add parquet suppor.
Via @rwblair hyparquet seems to have an easier-to-use API.
As I look at it, I don't think attempting to unify parquet and TSV data is going to be desirable, so there will need to be branching logic on validation as well as load.
On the plus side, validating column types should be easy without looking at column contents in many cases. Even columns with Levels or enumerated values can be stored as dictionary arrays, allowing us to validate the type only.
One open question is how to pass files to hyparquet's readers. There are built-in functions for converting filenames and URLs to hyparquet's AsyncBuffer type using node's fs API, but they are pretty simple:
https://github.com/hyparam/hyparquet/blob/8c4c7456b4a6dc1c7e76b695de3f867cf26d343c/src/utils.js#L76-L164
We could probably adapt them / contribute back upstream adapters for the Streams API.