datafusion-python icon indicating copy to clipboard operation
datafusion-python copied to clipboard

Question - Can `datafusion-python` be used without pyarrow?

Open matthewmturner opened this issue 3 years ago • 2 comments
trafficstars

I feel odd even asking this - but is it possible to make enhancements so that datafusion-python can be used without pyarrow? pyarrow is fantastic and I already use it, but, it is fairly large which makes it somewhat painful to deploy for some serverless use cases (such as on AWS Lambda). If I am able to do everything I need in datafusion is there a need for pyarrow? I confess I'm not very familiar with the interface between rust / datafusion and python / arrow so hopefully this isnt too stupid of a question.

thx!

matthewmturner avatar Feb 12 '22 03:02 matthewmturner

I think it might be possible; a good portion of the module doesn't require PyArrow. The only things that do are UDFs, UDAFs, and the parts of the Dataframe API that return PyArrow data structures (like collect(), and schema()). Does a datafusion-python without those features sound appealing?

wjones127 avatar Feb 20 '22 03:02 wjones127

Cool - that was what it looked like to me as well from my scan of the code. IMHO in the medium term it would be nice to have pyarrow as an optional feature. I think that datafusion should have some improvements on the IO front though before enabling this (im looking into / working on writing capabilities https://github.com/apache/arrow-datafusion/issues/1777). Right now I think pyarrow has more functionality there which is useful.

matthewmturner avatar Feb 20 '22 05:02 matthewmturner