datachain
datachain copied to clipboard
Feature request: `DataChain.unique`
Description
To avoid having to use pandas in a situations like this (during model training) we need to be able to implement the unique method (potentially also nunique)
@tibor-mach could you please update link this in the description?
(and probably put a simple sample code here as well to get some sense and so that it is not expired and visible publicly)
Isn't this the same as distinct?
From the unique docs:
>>> pd.unique(pd.Series([2, 1, 3, 3]))
array([2, 1, 3])
This passes:
def test_distinct(test_session):
dc = DataChain.from_values(
val=[2, 1, 3, 3], other_val=["a", "b", "c", "d"], session=test_session
)
assert sorted(dc.collect("val")) == [1, 2, 3, 3]
assert sorted(dc.distinct("val").collect("val")) == [1, 2, 3]
@mattseddon You're absolutely right. Closing this, already implemented as distinct