arkouda
arkouda copied to clipboard
Tracking Parquet follow up work
Here is a list of the Parquet follow up work that has been identified so far, items will be checked off as they are merged into master in follow up PRs. Please speak up if you have an opinion on prioritization or have identified any additional work that has been left out of this list.
- [x] Better error handling (added in https://github.com/Bears-R-Us/arkouda/pull/993)
- [x] Improve performance for reading
- [x] batch reading optimization (added in https://github.com/Bears-R-Us/arkouda/pull/1014)
- [x] parallel file reading https://github.com/Bears-R-Us/arkouda/pull/1050
- [ ] Improve write performance
- [x] batch writing https://github.com/Bears-R-Us/arkouda/pull/1028
- [ ] parallel file writing
- [x] Writing with Snappy/RLE https://github.com/Bears-R-Us/arkouda/pull/1104
- [x] Extend to support more reading of Arrow types to pdarrays
- [x] bool
- [x] float
- [x] uint32 (just need to thread through server code, C++ already has this)
- [x] uint64 https://github.com/Bears-R-Us/arkouda/pull/1070
- [x] timestamps https://github.com/Bears-R-Us/arkouda/pull/1024
- [x] Add capability for string reading
- [x] Support append mode
- [x] strings
- [x] other types
- [x] Auto detect Parquet/HDF5 file type at runtime
- [ ] Auto detect arrow/parquet support at build time
A couple things to add:
- [x] Extend
ak.get_datasets()
to parquet, or add an analogous function to list the columns available for reading from a set of parquet files. I think by default only return columns that arkouda can actually read, but maybe with a keyword arg to return all existing columns? - [x] If no column/dset names are passed to
ak.read_parquet()
, read all supported columns by default
Rough priorities on the remaining components:
- Reading strings
- Support for
get_datasets
or equivalent - Read all supported columns by default
- Reading float and bool
- Append mode for writing
- Auto-detect arrow/parquet
Longer-term or speculative elements (should not displace the above, but we should keep them in the back of our minds):
- API consolidation and standardization across HDF5 and parquet
- Support for parquet columns with nested types (like List), potentially mapping to akutil.SegArray.
@bmcdonald3 is the "Writing with Snappy/RLE" requirement satisfied by https://github.com/Bears-R-Us/arkouda/pull/1104
@pierce314159 yup, I've updated that; thanks.
Pierce caught today that we aren't handling uint32
completely in the Parquet code, so that should be added as well.
Closing this. #2133 has been added to track parallel writes and we are always building with Parquet, so we do not need detection of support at build.