koalas icon indicating copy to clipboard operation
koalas copied to clipboard

Nested data type support

Open rxin opened this issue 5 years ago • 9 comments

A lot of big data has nested data types, including structs, arrays, and maps. These are specific to big data / Spark and are not found in pandas, so we would need to design the APIs. It's best to write a short design doc on how to access and operate on the aforementioned three types.

Some initial thoughts after discussing with @thunterdb ...

We could reuse Spark's acessors, e.g. a.b means struct a's subfield b; a[0] means first element of array a. df['struct_field'] should return a DataFrame (so a DataFrame is internally backed by a query plan along with a column expression).

Note that the precedence should be if there is a column named a.b, then df['a.b'] should return a Series representing that column. But if there is no column named a.b, then df['a.b'] should return a Series representing struct column a's sub-field b.

rxin avatar Jun 03 '19 13:06 rxin

Oh hm.. should I close #419 for now?

HyukjinKwon avatar Jun 03 '19 14:06 HyukjinKwon

I feel that's still valid though.

rxin avatar Jun 03 '19 15:06 rxin

are you folks aware of: http://pandas.pydata.org/pandas-docs/stable/development/extending.html#extension-types

these are in-pandas support for arbitrary data types defined (internally or externally). These in fact were developed to allow for nested data types. Using a backing store of pyarrow nested types shouldn't be that hard to support.

jreback avatar Jun 03 '19 17:06 jreback

Hey @jreback. Thanks for some input here :-). Sounds like user defined type which PySpark has as well (but similarly rather internal and experimental).

pyarrow has nested types, yes but supporting to map an arbitrary type out of the box would be a non-trivial work. For instance, if we convert pandas to pyarrow somehow and then read it in JVM side, what types should it be :D ? We should probably somehow interact some metadata for it as well.

HyukjinKwon avatar Jun 04 '19 07:06 HyukjinKwon

@jreback thanks for commenting. Is there any docs or examples on how that works with pyarrow from the users' perspective?

rxin avatar Jun 04 '19 09:06 rxin

@HyukjinKwon from your comments: https://github.com/databricks/koalas/issues/420#issuecomment-498561529

pandas provides full fledged Extension Dtypes (which are used in for example Datetime w/tz, Intervals, Categorical, nullable Ints for example), that simply are 1-d array containers. These work very similarly to how for example pandas holds floats (or ints or whatever). These are completely abstracted away from the user.

pyarrow is merely a way to hold nested data, but it is not necessary for the holder of the DataFrame to even introspect / peek at this data (or care) about the implementation; except to use 'special' methods, e.g. imagine holding JSON (we have a trivial example here.

There is a project (will be public soon, but not yet) to actually add this in a first class way using pyarrow as the backing store.

jreback avatar Jun 04 '19 19:06 jreback

Ah, thanks for details. I look forward that project to be public.

One problem we face is that currently the data itself, for instance, a pandas instance is purely transferred as pyarrow batches and becomes a pandas instance back at another node. So, other information has to be transferred in some ways to that node.

For PySpark's UDT, those other information like implementation class is held as JSON - inside that, the class definition is pickled and unpickled via cloudpickle later in another node.

As far as I know pyarrow <> pandas conversion isn't supported for Extension Dtypes out of the box in pyarrow (correct me if I am wrong). Probably we should combine such way of PySpark's UDT to leverage pyarrow as the backing store so that they can be used in other nodes everywhere.

HyukjinKwon avatar Jun 05 '19 01:06 HyukjinKwon

pyarrow is soon going to be able to fully replicate a round-trip of a DataFrame with extension types (and serialize to parquet), see: https://issues.apache.org/jira/browse/ARROW-2428, https://issues.apache.org/jira/browse/ARROW-5271, https://issues.apache.org/jira/browse/ARROW-3829

jreback avatar Jun 05 '19 17:06 jreback

That would be nice to have it! I wonder how the types will be handled when they are roundtriped in non-Python ends (JVM for instance because that's what we do). If that's supported out of the box, yea, job done.

HyukjinKwon avatar Jun 05 '19 23:06 HyukjinKwon