koalas icon indicating copy to clipboard operation
koalas copied to clipboard

pyarrow interoperability and integration testing

Open wesm opened this issue 4 years ago • 4 comments

We're starting to get some bug reports in Apache Arrow from koalas users

https://issues.apache.org/jira/browse/ARROW-7986

A user commented

koalas in this case relies on pyarrow to convert the pyspark.ml.linalg-specific SparseVector, but arrow does not know what to do with it. The SparseVector comes from "libsvm" format. Here is an example using that format.

We've been working on providing tools to enable third party extensibility as well as informing the Arrow libraries how to handle serialization. There's probably some other things we can do to help in these scenarios (without taking on responsibility in Apache Arrow for handling every third party data structure out there)

Let us know how we can help with this so that we can coordinate and clarify any work needed on each side of the koalas / pyarrow divide

cc @jorisvandenbossche

wesm avatar Mar 04 '20 16:03 wesm

Thanks for escalating the issue here!

It seems related to Spark UDT and Arrow. Actually we can have a workaround to convert from/to pandas DataFrame/Series including Spark UDT objects, but I don't know how it works when manipulating pandas/Koalas. If the user wants to use it in some function taking user functions, like apply, applymap, etc., Koalas doesn't support Spark UDTs since Koalas relies on PySpark's pandas UDFs which doesn't support Spark UDTs yet. (See e.g., SPARK-29952)

If you already have something to inform the Arrow libraries how to handle serialization of third party libraries, we can ask Spark community to support Spark UDT with it.

ueshin avatar Mar 04 '20 20:03 ueshin

At the moment I don't think we have an API to provide for user-defined "unboxing" of array cell types into the corresponding Arrow storage, so we'd need to do a bit of thinking about what should happen. Here is a JIRA issue about the topic

https://issues.apache.org/jira/browse/ARROW-8004

wesm avatar Mar 05 '20 00:03 wesm

Did https://github.com/databricks/koalas/pull/1324 fix the original reported issue from https://issues.apache.org/jira/browse/ARROW-7986 about converting the sparse vector to a koalas Series?

jorisvandenbossche avatar Mar 10 '20 10:03 jorisvandenbossche

Yes, the issue https://issues.apache.org/jira/browse/ARROW-7986 can be closed. #1324 fixed the original issue.

>>> import pandas as pd
>>> import databricks.koalas as ks
>>> from pyspark.ml.linalg import SparseVector
>>>
>>> sparse_values = {0: 0.1, 1: 1.1}
>>> sparse_vector = SparseVector(len(sparse_values), sparse_values)
>>> pds = pd.Series(sparse_vector)
>>> kss = ks.Series(sparse_vector)
>>> kss
0    (0.1, 1.1)
Name: 0, dtype: object

However, we should note that this UDT support itself is supposed to be private at this moment, and Koalas does not currently support nested types.

HyukjinKwon avatar Mar 12 '20 06:03 HyukjinKwon