datachain icon indicating copy to clipboard operation
datachain copied to clipboard

Feature request: `DataChain.unique`

Open tibor-mach opened this issue 1 year ago • 2 comments

Description

To avoid having to use pandas in a situations like this (during model training) we need to be able to implement the unique method (potentially also nunique)

tibor-mach avatar Oct 14 '24 09:10 tibor-mach

@tibor-mach could you please update link this in the description?

(and probably put a simple sample code here as well to get some sense and so that it is not expired and visible publicly)

shcheklein avatar Oct 14 '24 17:10 shcheklein

Isn't this the same as distinct?

From the unique docs:

>>> pd.unique(pd.Series([2, 1, 3, 3]))
array([2, 1, 3])

This passes:

def test_distinct(test_session):
    dc = DataChain.from_values(
        val=[2, 1, 3, 3], other_val=["a", "b", "c", "d"], session=test_session
    )

    assert sorted(dc.collect("val")) == [1, 2, 3, 3]
    assert sorted(dc.distinct("val").collect("val")) == [1, 2, 3]

mattseddon avatar Oct 15 '24 00:10 mattseddon

@mattseddon You're absolutely right. Closing this, already implemented as distinct

tibor-mach avatar Oct 15 '24 08:10 tibor-mach