hyparquet
hyparquet copied to clipboard
Query API
This is where we'll discuss adding a Query API to allow users to efficiently retrieve the data they're looking for. Users can perform column chunk/page-level predicate pushdown manually using column and offset indexes, but this is quite labor-intensive and requires some knowledge of how parquet files work.
Ideally, users should be able to write simple queries that are analyzed and used to construct efficient query plans/predicate functions which evaluate column chunk and page statistics. We could also try to optimize data fetching, especially over the network, by making multirange queries (when supported or specified via a flag) and concatenating requests for (nearly) adjacent byte ranges.
User-defined predicate functions are another option (potentially just a lower-level API), which can be implemented with fairly little effort. This would allow users to define arbitrary predicate logic and would also be a valid way of implementing different query frontends (eg: the simple structured queries discussed earlier).