BentoML icon indicating copy to clipboard operation
BentoML copied to clipboard

feature: Arrow table input/output

Open judahrand opened this issue 2 years ago • 4 comments
trafficstars

Feature request

I think that it would be great to add Arrow Tables as an IO type for BentoML endpoints. This would be particularly beneficial for the GRPC server where the Arrow IPC format (not Parquet) could be used directly by dumping the data in the serialized_bytes field of the Protobuf message.

Motivation

Parquet is currently used to move Pandas DataFrames around in BentoML and is a great storage format but it doesn't maintain all of the great properties of the in-memory Arrow format (because it is designed as an on-disk format) like strict register alignment. It maaay reduce on-the-wire data size but will almost certain increase serialization/deserialization time.

I believe that this addition would:

  • reduce serialization/deserialization latency
  • allow for the easy use of other tools within the Arrow ecosystem (Polars, Datafusion, DuckDB, etc etc.)

Other

No response

judahrand avatar Aug 15 '23 08:08 judahrand

Hi @judahrand - we are working on a new iteration of IO Descriptor in BentoML and it will come with Arrow support! cc @frostming

parano avatar Oct 31 '23 17:10 parano

Does the code that's in development exist somewhere? I'd be interested in having a read.

judahrand avatar Oct 31 '23 18:10 judahrand

Does the code that's in development exist somewhere? I'd be interested in having a read.

Sure, #4240

frostming avatar Nov 01 '23 13:11 frostming

@parano Did Arrow support ever get added?

judahrand avatar Mar 05 '24 15:03 judahrand