awkward icon indicating copy to clipboard operation
awkward copied to clipboard

to_pandas options

Open martindurant opened this issue 3 years ago • 7 comments

Description of new feature

It may well be that this is already done, but we are not aware how.

We would like to be able to do the following operations on an awkward array, so to support its use as a pandas extension type. These could be done in the context of the existing to_pandas, although they need not concern pandas, actually, we can handle that in the awkward-pandas package.

  • where every row of the input becomes nested python structures, an object type array; this is already done via tolist()
  • in which every simple field (leaf, numpy-like) is extracted as simple array (numpy/series), but the "rest" of the data structure, the original minus those fields, is returned as awkward
  • the same as above, but the "rest" becomes python objects (I suppose this can be done in two steps combining the previous two bullets)
  • the same again, but with the option that top-level string fields become arrow string arrays

@douglasdavis , am I missing anything?

martindurant avatar Jul 29 '22 01:07 martindurant

Does the second bullet refer to regular fields, I.e those that can pass through to_numpy?

agoose77 avatar Jul 29 '22 08:07 agoose77

Yes, numpy-able fields (only one dimension deep), plus strings.

martindurant avatar Jul 29 '22 12:07 martindurant

Let's be sure to name the function @douglasdavis is working on as "to_series" as opposed to "to_dataframe", re: #1546. With these two rather different ways of making Pandas objects, we don't want to muddle the issue by calling them both "to_pandas".

jpivarski avatar Jul 29 '22 14:07 jpivarski

@martindurant for (2) what would you like to happen to nested records, e.g.

var * {
    trip: {
        sec: ?float32,
        km: ?float32,
        begin: {
            lon: ?float64,
            lat: ?float64,
            time: ?datetime64[ms]
        },
        end: {
            lon: ?float64,
            lat: ?float64,
            time: ?datetime64[ms]
        },
        path: var * {
            londiff: float32,
            latdiff: float32
        }
    },
    payment: {
        fare: ?float32,
        tips: ?float32,
        total: ?float32,
        type: var * char
    },
    company: var * char
}

My guess is that you might want to flatten the record structure, i.e. produce fields with names such as "trip.begin.long", etc?

agoose77 avatar Jul 29 '22 15:07 agoose77

you might want to flatten the record structure

no, we want to allow users to do whatever flattening they might like, but we should not do that by default. We do not want a multiindex.

martindurant avatar Jul 29 '22 15:07 martindurant

Right: that's the difference between this to_series (put a whole Awkward Array into a column, without modification) and ak.to_dataframe (currently named ak.to_pandas, which splits nested records into a MultiIndex of columns and nested lists into a MultiIndex of rows).

jpivarski avatar Jul 29 '22 15:07 jpivarski

Ah, I'd missed the part where you mention this is for an extension type. I'm following :)_

agoose77 avatar Jul 29 '22 15:07 agoose77

It looks like you don't want arguments to be added to a to_pandas (or to_series) function; it looks like you need instructions for extracting NumPy arrays (leaves) from an arbitrary array (tree).

I don't see a way (or what it would mean) to remove those leaves from the Awkward Array, but here's how to collect them:

>>> import awkward as ak
>>> import numpy as np

>>> def action(layout, lateral_context, **extra):
...     if layout.is_NumpyType:
...         lateral_context["collect"].append(np.asarray(layout))
... 

>>> array = ak.Array([
...     [{"x": 1.1, "y": [1]}, {"x": 2.2, "y": [1, 2]}, {"x": 3.3, "y": [1, 2, 3]}],
...     [],
...     [{"x": 4.4, "y": [1, 2, 3, 4]}, {"x": 5.5, "y": [1, 2, 3, 4, 5]}]
... ])

>>> context = {"collect": []}
>>> ak.transform(action, array, lateral_context=context, numpy_to_regular=True)
<Array [[{x: 1.1, y: [1]}, ..., {...}], ...] type='3 * var * {x: float64, y...'>

>>> context["collect"]
[array([1.1, 2.2, 3.3, 4.4, 5.5]), array([1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5])]

The ak.transform returned the array it was given, but context["collect"] has zero-copy references to the NumPy arrays. The numpy_to_regular=True option turns any multidimensional NumPy arrays into flattened 1-dimensional arrays, so using that or not is your choice. (Since you don't care about the returned Awkward Array, you could just call np.ravel on the arrays you collect.)

If the array has been manipulated, like this indexed slice,

>>> array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> array[[-1, 3, 3, 0, 2]]
<Array [[6, 7, 8, 9], [5], [5], [0, 1, 2], [3, 4]] type='5 * var * int64'>

then the interior NumPy arrays might not be what you expect. For instance, the sliced array still contains the original 0 1 2 3 4 5 6 7 8 9, not the rearranged 6 7 8 9 5 5 0 1 2 3 4.

>>> array[[-1, 3, 3, 0, 2]].layout
<ListArray len='5'>
    <starts><Index dtype='int64' len='5'>
        [6 5 5 0 3]
    </Index></starts>
    <stops><Index dtype='int64' len='5'>
        [10  6  6  3  5]
    </Index></stops>
    <content><NumpyArray dtype='int64' len='10'>[0 1 2 3 4 5 6 7 8 9]</NumpyArray></content>
</ListArray>

>>> context = {"collect": []}
>>> ak.transform(action, array[[-1, 3, 3, 0, 2]], lateral_context=context)
<Array [[6, 7, 8, 9], [5], [5], [0, 1, 2], [3, 4]] type='5 * var * int64'>
>>> context["collect"]
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])]

But ak.packed would put it into a canonical form, such that the interior buffers contain what you see.

>>> context = {"collect": []}
>>> ak.transform(action, ak.packed(array[[-1, 3, 3, 0, 2]]), lateral_context=context)
<Array [[6, 7, 8, 9], [5], [5], [0, 1, 2], [3, 4]] type='5 * var * int64'>
>>> context["collect"]
[array([6, 7, 8, 9, 5, 5, 0, 1, 2, 3, 4])]

Is that what you needed? If not, let me know and I'll reopen this issue.

jpivarski avatar Nov 10 '22 23:11 jpivarski