cudf [FEA] Get Series.list offsets / Construct Series of lists from offsets and values

Is your feature request related to a problem? Please describe. I would like to be able to access the offsets of a Series of lists. That would allow me to implement a function like list_add that takes two "awkward arrays," Series of lists of numbers that have the same list shape, and adds them together. The binary operation can be straightforwardly applied to the "leaves" of each list column, which is the child column containing the data. However, to do this, I need a way to access indices and rebuild the list structure. For example, if Series.list.offsets and cudf.Series.list.from_arrays(offsets, values) existed, I could run something like:

def list_add(s1, s2):
    """Take two Series of lists of numerical data and add them."""
    # Ignore nested lists for simplicity -- this only works for a single level of lists
    if s1.list.offsets!= s2.list.offsets:
        raise ValueError("List columns must have corresponding offsets.")
    return cudf.Series.list.from_arrays(s1.list.offsets, s1.list.leaves + s2.list.leaves)

Describe the solution you'd like

Implement a property Series.list.offsets that exposes the offset array, similar to PyArrow's pyarrow.ListArray.offsets but returning a GPU-resident array.
Implement a constructor Series.list.from_arrays(offsets, values) that builds a Series of lists from input offsets and values, similar to PyArrow's pyarrow.ListArray.from_arrays but enabling construction from GPU-resident arrays.

Describe alternatives you've considered I strongly prefer this approach over implementing binops directly on list types because it allows for precise control of what APIs are exposed and how they behave. Implementing binops for lists would allow for operators like + to be used, which is prone to error because it overloads the Python-like list semantics of "adding is list concatenation" with the array-like semantics of normal addition.

Additional context It's not clear to me where the name "leaves" came from. To align with PyArrow, we would rename "leaves" to Series.list.values.

Jun 21 '22 21:06 bdice

FWIW, there is a "Pandas compatible" way to do this today: https://github.com/rapidsai/cudf/issues/10967#issuecomment-1138590222. But I'd agree that a more explicit API would be desirable.

I wouldn't have any objections to adding an .offsets accessor, other than I suppose it leaks some implementation detail (insofar as cuDF following the Arrow format is an "implementation detail").

My 2c here is that the ideal way to do this would be to zero copy to something like a GPU accelerated Awkward Array and back.

Jun 22 '22 01:06 shwina

@shwina That's very helpful, I did not consider explode()/agg(list). To simplify and match this example:

def list_add(s1, s2):
    return (s1.explode() - s2.explode()).groupby(level=0).agg(list)

I'm guessing that explode() returns a copy, unlike Series.list.leaves, and that groupby(level=0).agg(list) is nontrivial to compute compared to a constructor from offsets and values. Perhaps there would be good reasons for performance and flexibility to expose the offset accessor / list constructor primitives.

As to whether offsets are an implementation detail -- I considered this as well. My view is that offsets are helpful to expose and doing so does not make stronger promises about our data model than what we already make in other ways (offsets are already exposed in the libcudf API, and cuDF has a stated aim to be Arrow-conformant to a large extent).

I definitely agree that the ability to do this computation in a zero-copy way and compatibility with GPU Awkward Arrays would be desirable. Exposing the raw offsets and a way to rebuild a list from them seems like a good step in both of those directions.

Jun 22 '22 02:06 bdice

It's not clear to me where the name "leaves" came from. To align with PyArrow, we would rename "leaves" to Series.list.values.

Note that values are distinct from leaves:

The values of a list array is what you get by removing "one level of nesting" from the array:

>>> pa.array([[[[1, 2]]]]).values
<pyarrow.lib.ListArray object at 0x7fe4de05df40>
[
  [
    [
      1,
      2
    ]
  ]
]

Whereas what we call leaves is what you get from removing all levels of nesting:

In [7]: cudf.Series([[[[1, 2]]]]).list.leaves
Out[7]:
0    1
1    2
dtype: int64

Jun 22 '22 14:06 shwina

@shwina Interesting. Would you consider exposing both list.values and list.leaves? It seems important to have a way to un-nest one level at a time (like with list.offsets).

Jun 22 '22 15:06 bdice

Would you consider exposing both list.values and list.leaves? It seems important to have a way to un-nest one level at a time (like with list.offsets).

Again, while I'm not opposed to exposing these, I'm much more in favor of higher-level APIs that allow the user not to worry about how lists are actually implemented. For example, if we want to enable binary/unary ops involving list columns, perhaps a better API is something like eval?

df.list.eval("a + b * sin(c)"`

Jun 22 '22 15:06 shwina

I would expect eval’s behavior with + to match the + operator’s behavior, but we stated in a previous conversation (last week’s standup, I think?) that we explicitly do not want to overload operators where array-like operator semantics could conflict with Python list operator semantics (concatenation vs. elementwise addition). I am opposed to making eval act elementwise on lists — I expect an error there. An explicit function like array_add makes it more clear how the lists are being interpreted.

In any case, I think the right move is to add offsets/values accessors for alignment with libcudf and PyArrow, and debate/implement the action of array-like operators separately.

I think it is important to be able to construct lists from GPU resident arrays, but that may not be possible without relying on the implementation of offsets/values.

Jun 22 '22 15:06 bdice

Right, which is why I'm suggesting a distinct DataFrame.list.eval API (note the namespace).

Jun 22 '22 15:06 shwina

Right, which is why I'm suggesting a distinct DataFrame.list.eval API (note the namespace).

I missed that namespace, thanks for the pointer. I have a lot of questions about how this would act and I don’t think the answers are obvious. AST limitations could be harshly constraining here and no broadcasting would be possible. It also introduces an undesirable asymmetry between operators and eval, and is beyond the API scope of both Pandas and Arrow… but so is array_add. Let’s table this for a separate discussion. @GregoryKimball might have insight on use cases that would motivate this but I don’t think we have an urgent need for new APIs if we implement the accessors / constructor.

Jun 22 '22 16:06 bdice

I agree - let's move the discussion relating to eval elsewhere.

My broader point though is that we shouldn't require the user to know or care about .values and .offsets in order to do interesting things with lists in cuDF.

Indeed, we could expose values and offsets and, with some effort, much of the existing list functionality could be implemented by the user knowing those.

Two questions I would ask are:

What functionality do we want to ultimately unlock for users by exposing the .offsets and .values of an existing list column? Can/should we implement that functionality ourselves?
Who/what is producing GPU-resident offsets and values arrays that requires a from_arrays() constructor? Can we have it return a list column instead?

We have been successful so far in totally hiding how strings, for instance, are implemented in cuDF. It'd be nice to do the same for lists.

Jun 22 '22 16:06 shwina

Hello @shwina and @bdice, bucketize is a feature that we might unlock if we could construct a list column from offsets and values. Bucketize is performed on leaves and uses the same offsets of the input columns.

def bucketize(series, buckets):    
    ans = cudf.Series([0] * len(series.list.leaves)) 
    for b in buckets:
        ans += series.list.leaves > b
    return cudf.Series.list.from_arrays(series.list.offsets, ans.list.leaves)

Jul 04 '22 02:07 GregoryKimball

To my surprise the explode trick from #10967 works here as well:

def bucketize(a, buckets):
    a_x = a.explode()
    b = a_x * 0
    for k in buckets:
        b += a_x > k
    return b.groupby(level=0).agg(list)

import cudf
df = cudf.DataFrame({'a':[[1, 2, 3, 3],[1, 2, 1, 0, 1]]})
df['b'] = bucketize(df['a'], [1, 2])

                 a                b
0     [1, 2, 3, 3]     [0, 1, 2, 2]
1  [1, 2, 1, 0, 1]  [0, 1, 0, 0, 0]

Jul 04 '22 04:07 GregoryKimball

Then list.bucketize() is an API we may want to consider adding, rather than having each user write their own version of it.

Jul 04 '22 13:07 shwina

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Aug 03 '22 14:08 github-actions[bot]

cudf cudf copied to clipboard

[FEA] Get Series.list offsets / Construct Series of lists from offsets and values

cudf
cudf copied to clipboard