arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Tracking Parquet follow up work

Open bmcdonald3 opened this issue 3 years ago • 5 comments

Here is a list of the Parquet follow up work that has been identified so far, items will be checked off as they are merged into master in follow up PRs. Please speak up if you have an opinion on prioritization or have identified any additional work that has been left out of this list.

  • [x] Better error handling (added in https://github.com/Bears-R-Us/arkouda/pull/993)
  • [x] Improve performance for reading
    • [x] batch reading optimization (added in https://github.com/Bears-R-Us/arkouda/pull/1014)
    • [x] parallel file reading https://github.com/Bears-R-Us/arkouda/pull/1050
  • [ ] Improve write performance
    • [x] batch writing https://github.com/Bears-R-Us/arkouda/pull/1028
    • [ ] parallel file writing
  • [x] Writing with Snappy/RLE https://github.com/Bears-R-Us/arkouda/pull/1104
  • [x] Extend to support more reading of Arrow types to pdarrays
    • [x] bool
    • [x] float
    • [x] uint32 (just need to thread through server code, C++ already has this)
    • [x] uint64 https://github.com/Bears-R-Us/arkouda/pull/1070
    • [x] timestamps https://github.com/Bears-R-Us/arkouda/pull/1024
  • [x] Add capability for string reading
  • [x] Support append mode
    • [x] strings
    • [x] other types
  • [x] Auto detect Parquet/HDF5 file type at runtime
  • [ ] Auto detect arrow/parquet support at build time

bmcdonald3 avatar Nov 30 '21 21:11 bmcdonald3

A couple things to add:

  • [x] Extend ak.get_datasets() to parquet, or add an analogous function to list the columns available for reading from a set of parquet files. I think by default only return columns that arkouda can actually read, but maybe with a keyword arg to return all existing columns?
  • [x] If no column/dset names are passed to ak.read_parquet(), read all supported columns by default

reuster986 avatar Feb 14 '22 15:02 reuster986

Rough priorities on the remaining components:

  1. Reading strings
  2. Support for get_datasets or equivalent
  3. Read all supported columns by default
  4. Reading float and bool
  5. Append mode for writing
  6. Auto-detect arrow/parquet

Longer-term or speculative elements (should not displace the above, but we should keep them in the back of our minds):

  1. API consolidation and standardization across HDF5 and parquet
  2. Support for parquet columns with nested types (like List), potentially mapping to akutil.SegArray.

reuster986 avatar Feb 16 '22 18:02 reuster986

@bmcdonald3 is the "Writing with Snappy/RLE" requirement satisfied by https://github.com/Bears-R-Us/arkouda/pull/1104

stress-tess avatar Feb 28 '22 18:02 stress-tess

@pierce314159 yup, I've updated that; thanks.

bmcdonald3 avatar Feb 28 '22 18:02 bmcdonald3

Pierce caught today that we aren't handling uint32 completely in the Parquet code, so that should be added as well.

bmcdonald3 avatar Mar 01 '22 00:03 bmcdonald3

Closing this. #2133 has been added to track parallel writes and we are always building with Parquet, so we do not need detection of support at build.

Ethan-DeBandi99 avatar Feb 09 '23 16:02 Ethan-DeBandi99