hyparquet icon indicating copy to clipboard operation
hyparquet copied to clipboard

Why use callbacks instead of iterators?

Open domoritz opened this issue 8 months ago • 2 comments

I was wondering about the API design and why the reader uses callbacks instead of an iterator/stream interface?

domoritz avatar Jun 17 '25 16:06 domoritz

It's been hard to find exactly the right API. Some people want a stream, others just want all the data when its ready. Iterators are much slower if you're trying to just rip through a lot of data. And do you iterate by row or by row group? Iterator or AsyncIterator? Also parquet is a column-oriented format so what if one column is ready before another column? Should the user have to wait for all columns to be ready to display the first data?

A few weeks ago I did a major refactor that helps with this. The new parquetReadAsync function returns a AsyncRowGroup[] which is essentially a map of all the ColumnChunks (as promises) needed to fulfill a query, so that you can reassemble the data in whatever order you want. https://github.com/hyparam/hyparquet/pull/83

function parquetReadAsync(options: ParquetReadOptions): AsyncRowGroup[]

interface AsyncRowGroup {
  groupStart: number // index of the first row in this rowgroup
  groupRows: number // number of rows in this rowgroup
  asyncColumns: AsyncColumn[]
}
interface AsyncColumn {
  pathInSchema: string[]
  data: Promise<DecodedArray[]>
}

Its fairly straightforward to await and convert an AsyncRowGroup into any of:

  • Promise<Record<string, any>[]> materialized rows
  • AsyncIterator<Record<string, any>> row iterator
  • AsyncIterator<Record<string, any>[]> row group iterator
  • AsyncIterator<DecodedArray> iterator over a single column without transposing to rows
  • etc

I am starting to think about what API I would design for hyparquet v2. Any thoughts or requests?

platypii avatar Jun 17 '25 17:06 platypii

Thanks for the insights. I think a row focused api that just makes it feel pikte iterating over an array of objects for convenience would be one. Having something familiar and convenient albeit not fast would be great for initial adoptions.

And then something column oriented for better performance and something with incremental updates as things get loaded makes sense.

I personally would love something like the API from arrow js and flechette that give you something that looks like an array of objects so that you can easily pass it into existing libraries like Vega.

domoritz avatar Jun 17 '25 18:06 domoritz