koalas
koalas copied to clipboard
Nested data type support
A lot of big data has nested data types, including structs, arrays, and maps. These are specific to big data / Spark and are not found in pandas, so we would need to design the APIs. It's best to write a short design doc on how to access and operate on the aforementioned three types.
Some initial thoughts after discussing with @thunterdb ...
We could reuse Spark's acessors, e.g. a.b
means struct a's subfield b; a[0]
means first element of array a. df['struct_field']
should return a DataFrame (so a DataFrame is internally backed by a query plan along with a column expression).
Note that the precedence should be if there is a column named a.b
, then df['a.b']
should return a Series representing that column. But if there is no column named a.b
, then df['a.b']
should return a Series representing struct column a's sub-field b.
Oh hm.. should I close #419 for now?
I feel that's still valid though.
are you folks aware of: http://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types
these are in-pandas support for arbitrary data types defined (internally or externally). These in fact were developed to allow for nested data types. Using a backing store of pyarrow
nested types shouldn't be that hard to support.
Hey @jreback. Thanks for some input here :-). Sounds like user defined type which PySpark has as well (but similarly rather internal and experimental).
pyarrow
has nested types, yes but supporting to map an arbitrary type out of the box would be a non-trivial work. For instance, if we convert pandas to pyarrow somehow and then read it in JVM side, what types should it be :D ? We should probably somehow interact some metadata for it as well.
@jreback thanks for commenting. Is there any docs or examples on how that works with pyarrow from the users' perspective?
@HyukjinKwon from your comments: https://github.com/databricks/koalas/issues/420#issuecomment-498561529
pandas provides full fledged Extension Dtypes (which are used in for example Datetime w/tz, Intervals, Categorical, nullable Ints for example), that simply are 1-d array containers. These work very similarly to how for example pandas holds floats (or ints or whatever). These are completely abstracted away from the user.
pyarrow
is merely a way to hold nested data, but it is not necessary for the holder of the DataFrame to even introspect / peek at this data (or care) about the implementation; except to use 'special' methods, e.g. imagine holding JSON (we have a trivial example here.
There is a project (will be public soon, but not yet) to actually add this in a first class way using pyarrow
as the backing store.
Ah, thanks for details. I look forward that project to be public.
One problem we face is that currently the data itself, for instance, a pandas
instance is purely transferred as pyarrow
batches and becomes a pandas
instance back at another node. So, other information has to be transferred in some ways to that node.
For PySpark's UDT, those other information like implementation class is held as JSON - inside that, the class definition is pickled and unpickled via cloudpickle
later in another node.
As far as I know pyarrow
<> pandas
conversion isn't supported for Extension Dtypes out of the box in pyarrow
(correct me if I am wrong). Probably we should combine such way of PySpark's UDT to leverage pyarrow
as the backing store so that they can be used in other nodes everywhere.
pyarrow is soon going to be able to fully replicate a round-trip of a DataFrame with extension types (and serialize to parquet), see: https://issues.apache.org/jira/browse/ARROW-2428, https://issues.apache.org/jira/browse/ARROW-5271, https://issues.apache.org/jira/browse/ARROW-3829
That would be nice to have it! I wonder how the types will be handled when they are roundtriped in non-Python ends (JVM for instance because that's what we do). If that's supported out of the box, yea, job done.