flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core feature] [Flytekit] Add support for HDF5 and Arrow in flyteplugins-vaex

Open ryankarlos opened this issue 2 years ago • 6 comments

Motivation: Why do you think this is important?

Currently flyteplugins-vaex supports automatic serialization and deserialization of vaex dataframe between consecutive tasks using parquet https://github.com/flyteorg/flytekit/pull/1230

It would be good to extend this to HDF5 and arrow for performance and interoperability, when data sets are too large to fit into memory https://vaex.readthedocs.io/en/latest/faq.html#What-is-the-optimal-file-format-to-use-with-vaex

Goal: What should the final outcome look like, ideally?

Register extra handlers VaexDataFrameToHDF5EncodingHandler and VaexDataFrameToArrowEncodingHandler, so users can use Annotated to update the default format:

@task
def t1(f: vaex.dataframe.DataFrameLocal) -> Annotated[StructuredDataset, HDF5]

@task
def t2(f: vaex.dataframe.DataFrameLocal) -> Annotated[StructuredDataset, Arrow]

Describe alternatives you've considered

N/A

Propose: Link/Inline OR Additional context

See discussion thread here https://github.com/flyteorg/flytekit/pull/1230#discussion_r1006645274

Are you sure this issue hasn't been raised already?

  • [X] Yes

Have you read the Code of Conduct?

  • [X] Yes

ryankarlos avatar Oct 28 '22 22:10 ryankarlos