arkouda Tracking Parquet follow up work

Here is a list of the Parquet follow up work that has been identified so far, items will be checked off as they are merged into master in follow up PRs. Please speak up if you have an opinion on prioritization or have identified any additional work that has been left out of this list.

[x] Better error handling (added in https://github.com/Bears-R-Us/arkouda/pull/993)
[x] Improve performance for reading
- [x] batch reading optimization (added in https://github.com/Bears-R-Us/arkouda/pull/1014)
- [x] parallel file reading https://github.com/Bears-R-Us/arkouda/pull/1050
[ ] Improve write performance
- [x] batch writing https://github.com/Bears-R-Us/arkouda/pull/1028
- [ ] parallel file writing
[x] Writing with Snappy/RLE https://github.com/Bears-R-Us/arkouda/pull/1104
[x] Extend to support more reading of Arrow types to pdarrays
- [x] bool
- [x] float
- [x] uint32 (just need to thread through server code, C++ already has this)
- [x] uint64 https://github.com/Bears-R-Us/arkouda/pull/1070
- [x] timestamps https://github.com/Bears-R-Us/arkouda/pull/1024
[x] Add capability for string reading
[x] Support append mode
- [x] strings
- [x] other types
[x] Auto detect Parquet/HDF5 file type at runtime
[ ] Auto detect arrow/parquet support at build time

Nov 30 '21 21:11 bmcdonald3

A couple things to add:

[x] Extend ak.get_datasets() to parquet, or add an analogous function to list the columns available for reading from a set of parquet files. I think by default only return columns that arkouda can actually read, but maybe with a keyword arg to return all existing columns?
[x] If no column/dset names are passed to ak.read_parquet(), read all supported columns by default

Feb 14 '22 15:02 reuster986

Rough priorities on the remaining components:

Reading strings
Support for get_datasets or equivalent
Read all supported columns by default
Reading float and bool
Append mode for writing
Auto-detect arrow/parquet

Longer-term or speculative elements (should not displace the above, but we should keep them in the back of our minds):

API consolidation and standardization across HDF5 and parquet
Support for parquet columns with nested types (like List), potentially mapping to akutil.SegArray.

Feb 16 '22 18:02 reuster986

@bmcdonald3 is the "Writing with Snappy/RLE" requirement satisfied by https://github.com/Bears-R-Us/arkouda/pull/1104

Feb 28 '22 18:02 stress-tess

@pierce314159 yup, I've updated that; thanks.

Feb 28 '22 18:02 bmcdonald3

Pierce caught today that we aren't handling uint32 completely in the Parquet code, so that should be added as well.

Mar 01 '22 00:03 bmcdonald3

Closing this. #2133 has been added to track parallel writes and we are always building with Parquet, so we do not need detection of support at build.

Feb 09 '23 16:02 Ethan-DeBandi99

arkouda arkouda copied to clipboard

Tracking Parquet follow up work

arkouda
arkouda copied to clipboard