koalas
koalas copied to clipboard
pyarrow interoperability and integration testing
We're starting to get some bug reports in Apache Arrow from koalas users
https://issues.apache.org/jira/browse/ARROW-7986
koalas in this case relies on pyarrow to convert the pyspark.ml.linalg-specific SparseVector, but arrow does not know what to do with it. The SparseVector comes from "libsvm" format. Here is an example using that format.
We've been working on providing tools to enable third party extensibility as well as informing the Arrow libraries how to handle serialization. There's probably some other things we can do to help in these scenarios (without taking on responsibility in Apache Arrow for handling every third party data structure out there)
Let us know how we can help with this so that we can coordinate and clarify any work needed on each side of the koalas / pyarrow divide
cc @jorisvandenbossche
Thanks for escalating the issue here!
It seems related to Spark UDT and Arrow.
Actually we can have a workaround to convert from/to pandas DataFrame/Series including Spark UDT objects, but I don't know how it works when manipulating pandas/Koalas.
If the user wants to use it in some function taking user functions, like apply
, applymap
, etc., Koalas doesn't support Spark UDTs since Koalas relies on PySpark's pandas UDFs which doesn't support Spark UDTs yet. (See e.g., SPARK-29952)
If you already have something to inform the Arrow libraries how to handle serialization of third party libraries, we can ask Spark community to support Spark UDT with it.
At the moment I don't think we have an API to provide for user-defined "unboxing" of array cell types into the corresponding Arrow storage, so we'd need to do a bit of thinking about what should happen. Here is a JIRA issue about the topic
https://issues.apache.org/jira/browse/ARROW-8004
Did https://github.com/databricks/koalas/pull/1324 fix the original reported issue from https://issues.apache.org/jira/browse/ARROW-7986 about converting the sparse vector to a koalas Series?
Yes, the issue https://issues.apache.org/jira/browse/ARROW-7986 can be closed. #1324 fixed the original issue.
>>> import pandas as pd
>>> import databricks.koalas as ks
>>> from pyspark.ml.linalg import SparseVector
>>>
>>> sparse_values = {0: 0.1, 1: 1.1}
>>> sparse_vector = SparseVector(len(sparse_values), sparse_values)
>>> pds = pd.Series(sparse_vector)
>>> kss = ks.Series(sparse_vector)
>>> kss
0 (0.1, 1.1)
Name: 0, dtype: object
However, we should note that this UDT support itself is supposed to be private at this moment, and Koalas does not currently support nested types.