Allow arrow tables in read and write interface

Open fjetter opened this issue 6 years ago • 2 comments

We're using Apache Arrow as the ultimate tool to glue everything together. When writing data, we accept pandas dataframes, convert them to arrow tables and store them as parquet. When reading data it is the other way round. For some pipelines it is not necessary to convert the data ever to pandas and we could process the data directly without ever converting to pandas.

As a user I'd like to have the possibility to directly pass arrow tables to the kartothek pipeline to store them and have the option to read the data directly as an arrow table. Conversion to pandas should only be done if absolutely necessary. A current example for a necessary conversion would be the partition_on feature where we perform a groupby on the data.

Jun 03 '19 13:06 fjetter