cudf
cudf copied to clipboard
[FEA] Get Series.list offsets / Construct Series of lists from offsets and values
Is your feature request related to a problem? Please describe.
I would like to be able to access the offsets of a Series of lists. That would allow me to implement a function like list_add
that takes two "awkward arrays," Series of lists of numbers that have the same list shape, and adds them together. The binary operation can be straightforwardly applied to the "leaves" of each list column, which is the child column containing the data. However, to do this, I need a way to access indices and rebuild the list structure. For example, if Series.list.offsets
and cudf.Series.list.from_arrays(offsets, values)
existed, I could run something like:
def list_add(s1, s2):
"""Take two Series of lists of numerical data and add them."""
# Ignore nested lists for simplicity -- this only works for a single level of lists
if s1.list.offsets!= s2.list.offsets:
raise ValueError("List columns must have corresponding offsets.")
return cudf.Series.list.from_arrays(s1.list.offsets, s1.list.leaves + s2.list.leaves)
Describe the solution you'd like
-
Implement a property
Series.list.offsets
that exposes the offset array, similar to PyArrow'spyarrow.ListArray.offsets
but returning a GPU-resident array. -
Implement a constructor
Series.list.from_arrays(offsets, values)
that builds a Series of lists from input offsets and values, similar to PyArrow'spyarrow.ListArray.from_arrays
but enabling construction from GPU-resident arrays.
Describe alternatives you've considered
I strongly prefer this approach over implementing binops directly on list types because it allows for precise control of what APIs are exposed and how they behave. Implementing binops for lists would allow for operators like +
to be used, which is prone to error because it overloads the Python-like list semantics of "adding is list concatenation" with the array-like semantics of normal addition.
Additional context
It's not clear to me where the name "leaves" came from. To align with PyArrow, we would rename "leaves" to Series.list.values
.
FWIW, there is a "Pandas compatible" way to do this today: https://github.com/rapidsai/cudf/issues/10967#issuecomment-1138590222. But I'd agree that a more explicit API would be desirable.
I wouldn't have any objections to adding an .offsets
accessor, other than I suppose it leaks some implementation detail (insofar as cuDF following the Arrow format is an "implementation detail").
My 2c here is that the ideal way to do this would be to zero copy to something like a GPU accelerated Awkward Array and back.
@shwina That's very helpful, I did not consider explode()
/agg(list)
. To simplify and match this example:
def list_add(s1, s2):
return (s1.explode() - s2.explode()).groupby(level=0).agg(list)
I'm guessing that explode()
returns a copy, unlike Series.list.leaves
, and that groupby(level=0).agg(list)
is nontrivial to compute compared to a constructor from offsets and values. Perhaps there would be good reasons for performance and flexibility to expose the offset accessor / list constructor primitives.
As to whether offsets are an implementation detail -- I considered this as well. My view is that offsets are helpful to expose and doing so does not make stronger promises about our data model than what we already make in other ways (offsets are already exposed in the libcudf API, and cuDF has a stated aim to be Arrow-conformant to a large extent).
I definitely agree that the ability to do this computation in a zero-copy way and compatibility with GPU Awkward Arrays would be desirable. Exposing the raw offsets and a way to rebuild a list from them seems like a good step in both of those directions.
It's not clear to me where the name "leaves" came from. To align with PyArrow, we would rename "leaves" to Series.list.values.
Note that values
are distinct from leaves
:
The values
of a list array is what you get by removing "one level of nesting" from the array:
>>> pa.array([[[[1, 2]]]]).values
<pyarrow.lib.ListArray object at 0x7fe4de05df40>
[
[
[
1,
2
]
]
]
Whereas what we call leaves
is what you get from removing all levels of nesting:
In [7]: cudf.Series([[[[1, 2]]]]).list.leaves
Out[7]:
0 1
1 2
dtype: int64
@shwina Interesting. Would you consider exposing both list.values
and list.leaves
? It seems important to have a way to un-nest one level at a time (like with list.offsets
).
Would you consider exposing both list.values and list.leaves? It seems important to have a way to un-nest one level at a time (like with list.offsets).
Again, while I'm not opposed to exposing these, I'm much more in favor of higher-level APIs that allow the user not to worry about how lists are actually implemented. For example, if we want to enable binary/unary ops involving list columns, perhaps a better API is something like eval
?
df.list.eval("a + b * sin(c)"`
I would expect eval’s behavior with + to match the + operator’s behavior, but we stated in a previous conversation (last week’s standup, I think?) that we explicitly do not want to overload operators where array-like operator semantics could conflict with Python list operator semantics (concatenation vs. elementwise addition). I am opposed to making eval act elementwise on lists — I expect an error there. An explicit function like array_add
makes it more clear how the lists are being interpreted.
In any case, I think the right move is to add offsets/values accessors for alignment with libcudf and PyArrow, and debate/implement the action of array-like operators separately.
I think it is important to be able to construct lists from GPU resident arrays, but that may not be possible without relying on the implementation of offsets/values.
Right, which is why I'm suggesting a distinct DataFrame.list.eval
API (note the namespace).
Right, which is why I'm suggesting a distinct
DataFrame.list.eval
API (note the namespace).
I missed that namespace, thanks for the pointer. I have a lot of questions about how this would act and I don’t think the answers are obvious. AST limitations could be harshly constraining here and no broadcasting would be possible. It also introduces an undesirable asymmetry between operators and eval, and is beyond the API scope of both Pandas and Arrow… but so is array_add
. Let’s table this for a separate discussion. @GregoryKimball might have insight on use cases that would motivate this but I don’t think we have an urgent need for new APIs if we implement the accessors / constructor.
I agree - let's move the discussion relating to eval
elsewhere.
My broader point though is that we shouldn't require the user to know or care about .values
and .offsets
in order to do interesting things with lists in cuDF.
Indeed, we could expose values
and offsets
and, with some effort, much of the existing list functionality could be implemented by the user knowing those.
Two questions I would ask are:
-
What functionality do we want to ultimately unlock for users by exposing the
.offsets
and.values
of an existing list column? Can/should we implement that functionality ourselves? -
Who/what is producing GPU-resident
offsets
andvalues
arrays that requires afrom_arrays()
constructor? Can we have it return a list column instead?
We have been successful so far in totally hiding how strings, for instance, are implemented in cuDF. It'd be nice to do the same for lists.
Hello @shwina and @bdice, bucketize is a feature that we might unlock if we could construct a list column from offsets and values. Bucketize is performed on leaves and uses the same offsets of the input columns.
def bucketize(series, buckets):
ans = cudf.Series([0] * len(series.list.leaves))
for b in buckets:
ans += series.list.leaves > b
return cudf.Series.list.from_arrays(series.list.offsets, ans.list.leaves)
To my surprise the explode
trick from #10967 works here as well:
def bucketize(a, buckets):
a_x = a.explode()
b = a_x * 0
for k in buckets:
b += a_x > k
return b.groupby(level=0).agg(list)
import cudf
df = cudf.DataFrame({'a':[[1, 2, 3, 3],[1, 2, 1, 0, 1]]})
df['b'] = bucketize(df['a'], [1, 2])
a b
0 [1, 2, 3, 3] [0, 1, 2, 2]
1 [1, 2, 1, 0, 1] [0, 1, 0, 0, 0]
Then list.bucketize()
is an API we may want to consider adding, rather than having each user write their own version of it.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.