bids-validator icon indicating copy to clipboard operation
bids-validator copied to clipboard

Support for parquet format and/or dataframe access

Open effigies opened this issue 7 months ago • 2 comments
trafficstars

Related to https://github.com/bids-standard/bids-specification/issues/1792, which is on my mind again because of https://hbcd-docs.readthedocs.io/data_access/dataformats/tabulated/:

🔗 Note: Parquet Not Currently Supported by BIDS ▸

Please note that Parquet files are not officially supported by the BIDS specification. For NBDC datasets, we decided to add Parquet as an alternative file format to the BIDS standard TSV to allow users to take advantage of the features of this modern and efficient open source format that is commonly used in the data science community.

A large project like HBCD adopting parquet in addition to BIDS seems like an indication that this is a recognized hole in BIDS, and so I think #1792 is likely to move forward. The validator could be the biggest sticking point, so I want to get out ahead of it.

Potentially relevant projects:

  • hyparquet: A pure-javascript library that may have good browser support and <10KiB increase in the payload.
  • parquet-wasm: wasm bindings to the Rust parquet and arrow implementations.
  • apache-arrow: A library for working with parquet's underlying memory model (arrow). May also be useful for loading TSVs to the same data structures, ensuring unified treatment if we do add parquet suppor.

effigies avatar Apr 04 '25 14:04 effigies

Via @rwblair hyparquet seems to have an easier-to-use API.

As I look at it, I don't think attempting to unify parquet and TSV data is going to be desirable, so there will need to be branching logic on validation as well as load.

On the plus side, validating column types should be easy without looking at column contents in many cases. Even columns with Levels or enumerated values can be stored as dictionary arrays, allowing us to validate the type only.

effigies avatar Apr 04 '25 16:04 effigies

One open question is how to pass files to hyparquet's readers. There are built-in functions for converting filenames and URLs to hyparquet's AsyncBuffer type using node's fs API, but they are pretty simple:

https://github.com/hyparam/hyparquet/blob/8c4c7456b4a6dc1c7e76b695de3f867cf26d343c/src/utils.js#L76-L164

We could probably adapt them / contribute back upstream adapters for the Streams API.

effigies avatar Apr 04 '25 16:04 effigies