cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Support drop_duplicates on Series containing list objects

Open miguelusque opened this issue 4 years ago • 17 comments

Hi!

Is your feature request related to a problem? Please describe. It would be useful to be able to drop duplicates over a series which contains list objects.

Describe the solution you'd like To be able to drop duplicates on series containing list objects.

Describe alternatives you've considered We are using the code below, which is very CPU intensive.

# Remove duplicates 
compliant_indices["tmp"] = ['_'.join([str(z) for z in y]) for y in [sorted(x) for x in compliant_indices.values.tolist()]]
compliant_indices.drop_duplicates(subset="tmp", inplace=True)
compliant_indices.drop(columns=['tmp'], inplace=True)

Additional context Thanks!!!!

miguelusque avatar Nov 17 '20 01:11 miguelusque

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] avatar Feb 16 '21 20:02 github-actions[bot]

I think this feature request is still relevant.

miguelusque avatar Feb 20 '21 11:02 miguelusque

Hey guys! Can you provide an example for your feature request, please? We just have drop_list_duplicates API in C++ side and will have a Python binding soon. I'm not sure if this is what you want:

  • https://github.com/rapidsai/cudf/pull/7528
  • https://github.com/rapidsai/cudf/issues/7414

ttnghia avatar Mar 12 '21 16:03 ttnghia

I personally like @kkraus14 's suggestion

randerzander avatar Mar 12 '21 17:03 randerzander

Hi,

In the code above, we have a series where the column datatype is a list.

Each row it that series may contain different values, i.e.:

[1, 3, 5, 7] [2, 8, 4] [1, 3, 5, 7]

After applying drop_duplicates on that series, the output should be:

[1, 3, 5, 7] [2, 8, 4]

My only concern is when we have a list that contains the same number of elements but in a different order. For instance:

[1, 3, 5, 7] [2, 8, 4] [3, 1, 5, 7]

Should the last row be removed? My personal opinion is that, if we may have available sorted() or sort() method that would work on device memory, I would leave that functionality to explicitly invoke that method before invoking drop_duplicates(). Alternatively, I would add an ignore_sort or similar parameter to be able to perform that operation without needing to copy the data first to host_memory before dropping duplicates.

Hope the above helps!

Regards, Miguel

miguelusque avatar Mar 12 '21 19:03 miguelusque

This is an entirely different request than #7414.

This issue is asking for removing duplicate elements in a column where the column is list dtype so the elements are lists. i.e.

input = [
    [1, 1, 2],
    [3, 4, 4],
    [1, 1, 2]
]

output = [
    [1, 1, 2],
    [3, 4, 4]
]

#7414 is about removing duplicate values from within list elements. i.e.

input = [
    [1, 1, 2],
    [3, 4, 4],
    [1, 1, 2]
]

output = [
    [1, 2],
    [3, 4],
    [1, 2]
]

kkraus14 avatar Mar 12 '21 19:03 kkraus14

Okay, I see. Removing duplicate lists requires list comparison operator. That operator is also useful for list sorting, as requested here: https://github.com/rapidsai/cudf/issues/5890. So, once the list comparison is supported, we can address both issues at the same time.

ttnghia avatar Mar 14 '21 15:03 ttnghia

Hi,

Congratulations on the recent announcement of lists support on cuDF.

I hope this feature request could be considered for next release (0.20).

Thanks!

miguelusque avatar Apr 20 '21 10:04 miguelusque

@miguelusque it likely will not be. Operations that require comparing lists require us changes to our row comparator and have the potential to slow down all of libcudf if not done carefully. We're trying to figure out a path forward here but it's likely going to take some time.

kkraus14 avatar Apr 21 '21 21:04 kkraus14

@miguelusque a potential workaround (once https://github.com/rapidsai/cudf/pull/7929 lands) would be concatting list elements w/ a token into a string and doing drop duplicates against the string column.

randerzander avatar Apr 21 '21 22:04 randerzander

Hi @randerzander , I think that your proposal is similar to the workaround we are using currently, detailed at the top of this issue .

That would imply to first sort the elements in the list, then concatenate and then dropping the duplicates, right?

We wanted to avoid having to move the data to host memory, and that's why this feature request. (we need to sort the data in host memory, if I am not wrong)

Please, let me know if I have misunderstood your workaround.

Thanks!

miguelusque avatar Apr 21 '21 22:04 miguelusque

Sorting within lists is supported on GPU as well, so I think you'll be good once list->string concat is merged.

randerzander avatar Apr 21 '21 22:04 randerzander

Thank you! I was not aware of it! :-)

miguelusque avatar Apr 21 '21 22:04 miguelusque

@miguelusque - a related issue to watch for the workaround

randerzander avatar Apr 27 '21 14:04 randerzander

Hi, it looks like the following workaround works 100% in GPU memory. Could someone please confirm it? Thanks!

import cudf

df = cudf.DataFrame({"a": [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
b = df["a"].list.sort_values().list.astype(str).str.join("-").drop_duplicates()
b
0    1-3-5-7
1      2-4-8
Name: a, dtype: object

If the lists within the series already contain string elements, it might be more straightforward:

import cudf
df = cudf.DataFrame({"a": [["1", "3", "5", "7"], ["2", "8", "4"], ["1", "3", "5", "7"]]})

b = df["a"].list.sort_values().str.join("-").drop_duplicates()
b
0    1-3-5-7
1      2-4-8
Name: a, dtype: object

Considering that we already have all the pieces available in cuDF, it might be well-worth adding support to lists to drop_duplicates method.

Thanks!

miguelusque avatar Jun 25 '22 18:06 miguelusque

With the addition of list column support for distinct in libcudf (#10641), this issue just needs python bindings.

GregoryKimball avatar Jun 30 '22 22:06 GregoryKimball

In 22.06 drop_duplicates uses a sort-based algorithm and relies on the lexicographic comparator. We expect this will be closed by #11129

import cudf
df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
df['a'].drop_duplicates()
RuntimeError: cuDF failure at: /workspace/.conda-bld/work/cpp/src/table/row_operators.cu:267: Cannot lexicographic compare a table with a LIST column

GregoryKimball avatar Aug 01 '22 01:08 GregoryKimball

In 22.10 we are closer but there is some lingering issue. The list columns returns sorted, but the duplicates remain.

>>> df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0    [1, 3, 5, 7]
2    [1, 3, 5, 7]
1       [2, 8, 4]
Name: a, dtype: list

GregoryKimball avatar Oct 05 '22 02:10 GregoryKimball

May be fixed by:

  • https://github.com/rapidsai/cudf/issues/11089,

which is implemented in either:

  • https://github.com/rapidsai/cudf/pull/11230, or
  • https://github.com/rapidsai/cudf/pull/11656

ttnghia avatar Apr 06 '23 17:04 ttnghia

This feature is now available in 23.06.

>>> df = cudf.DataFrame({"a":  [[1, 3, 5, 7], [2, 8, 4], [1, 3, 5, 7]]})
>>> df['a'].drop_duplicates()
0    [1, 3, 5, 7]
1       [2, 8, 4]
Name: a, dtype: list

GregoryKimball avatar May 31 '23 04:05 GregoryKimball

@GregoryKimball Can we open a follow-up issue to add explicit Python tests for this? I would have done so in #11656 if I’d realized it could close this issue. Thanks!

bdice avatar May 31 '23 04:05 bdice