cudf
cudf copied to clipboard
[FEA] Support drop_duplicates on Series containing list objects
Hi!
Is your feature request related to a problem? Please describe. It would be useful to be able to drop duplicates over a series which contains list objects.
Describe the solution you'd like To be able to drop duplicates on series containing list objects.
Describe alternatives you've considered We are using the code below, which is very CPU intensive.
# Remove duplicates
compliant_indices["tmp"] = ['_'.join([str(z) for z in y]) for y in [sorted(x) for x in compliant_indices.values.tolist()]]
compliant_indices.drop_duplicates(subset="tmp", inplace=True)
compliant_indices.drop(columns=['tmp'], inplace=True)
Additional context Thanks!!!!
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
I think this feature request is still relevant.
Hey guys! Can you provide an example for your feature request, please? We just have drop_list_duplicates
API in C++ side and will have a Python binding soon. I'm not sure if this is what you want:
- https://github.com/rapidsai/cudf/pull/7528
- https://github.com/rapidsai/cudf/issues/7414
I personally like @kkraus14 's suggestion
Hi,
In the code above, we have a series where the column datatype is a list.
Each row it that series may contain different values, i.e.:
[1, 3, 5, 7] [2, 8, 4] [1, 3, 5, 7]
After applying drop_duplicates
on that series, the output should be:
[1, 3, 5, 7] [2, 8, 4]
My only concern is when we have a list that contains the same number of elements but in a different order. For instance:
[1, 3, 5, 7] [2, 8, 4] [3, 1, 5, 7]
Should the last row be removed? My personal opinion is that, if we may have available sorted()
or sort()
method that would work on device memory, I would leave that functionality to explicitly invoke that method before invoking drop_duplicates()
. Alternatively, I would add an ignore_sort
or similar parameter to be able to perform that operation without needing to copy the data first to host_memory before dropping duplicates.
Hope the above helps!
Regards, Miguel
This is an entirely different request than #7414.
This issue is asking for removing duplicate elements in a column where the column is list dtype so the elements are lists. i.e.
input = [
[1, 1, 2],
[3, 4, 4],
[1, 1, 2]
]
output = [
[1, 1, 2],
[3, 4, 4]
]
#7414 is about removing duplicate values from within list elements. i.e.
input = [
[1, 1, 2],
[3, 4, 4],
[1, 1, 2]
]
output = [
[1, 2],
[3, 4],
[1, 2]
]
Okay, I see. Removing duplicate lists requires list comparison operator. That operator is also useful for list sorting, as requested here: https://github.com/rapidsai/cudf/issues/5890. So, once the list comparison is supported, we can address both issues at the same time.
Hi,
Congratulations on the recent announcement of lists support on cuDF.
I hope this feature request could be considered for next release (0.20).
Thanks!
@miguelusque it likely will not be. Operations that require comparing lists require us changes to our row comparator and have the potential to slow down all of libcudf if not done carefully. We're trying to figure out a path forward here but it's likely going to take some time.
@miguelusque a potential workaround (once https://github.com/rapidsai/cudf/pull/7929 lands) would be concatting list elements w/ a token into a string and doing drop duplicates against the string column.
Hi @randerzander , I think that your proposal is similar to the workaround we are using currently, detailed at the top of this issue .
That would imply to first sort the elements in the list, then concatenate and then dropping the duplicates, right?
We wanted to avoid having to move the data to host memory, and that's why this feature request. (we need to sort the data in host memory, if I am not wrong)
Please, let me know if I have misunderstood your workaround.
Thanks!
Sorting within lists is supported on GPU as well, so I think you'll be good once list->string concat is merged.
Thank you! I was not aware of it! :-)
@miguelusque - a related issue to watch for the workaround
Hi, it looks like the following workaround works 100% in GPU memory. Could someone please confirm it? Thanks!
import cudf
df = cudf.DataFrame({"a": [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
b = df["a"].list.sort_values().list.astype(str).str.join("-").drop_duplicates()
b
0 1-3-5-7
1 2-4-8
Name: a, dtype: object
If the lists within the series already contain string elements, it might be more straightforward:
import cudf
df = cudf.DataFrame({"a": [["1", "3", "5", "7"], ["2", "8", "4"], ["1", "3", "5", "7"]]})
b = df["a"].list.sort_values().str.join("-").drop_duplicates()
b
0 1-3-5-7
1 2-4-8
Name: a, dtype: object
Considering that we already have all the pieces available in cuDF, it might be well-worth adding support to lists to drop_duplicates
method.
Thanks!
With the addition of list column support for distinct
in libcudf (#10641), this issue just needs python bindings.
In 22.06 drop_duplicates
uses a sort-based algorithm and relies on the lexicographic comparator. We expect this will be closed by #11129
import cudf
df = cudf.DataFrame({"a": [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
df['a'].drop_duplicates()
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/table/row_operators.cu:267: Cannot lexicographic compare a table with a LIST column
In 22.10 we are closer but there is some lingering issue. The list columns returns sorted, but the duplicates remain.
>>> df = cudf.DataFrame({"a": [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0 [1, 3, 5, 7]
2 [1, 3, 5, 7]
1 [2, 8, 4]
Name: a, dtype: list
May be fixed by:
- https://github.com/rapidsai/cudf/issues/11089,
which is implemented in either:
- https://github.com/rapidsai/cudf/pull/11230, or
- https://github.com/rapidsai/cudf/pull/11656
This feature is now available in 23.06.
>>> df = cudf.DataFrame({"a": [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0 [1, 3, 5, 7]
1 [2, 8, 4]
Name: a, dtype: list
@GregoryKimball Can we open a follow-up issue to add explicit Python tests for this? I would have done so in #11656 if I’d realized it could close this issue. Thanks!