torcharrow icon indicating copy to clipboard operation
torcharrow copied to clipboard

Efficient kernel implementation for `drop_duplicates` and `sort`

Open wenleix opened this issue 3 years ago • 0 comments
trafficstars

For single colulmn, delegating to Arrow Array seems to be a good initial support. Similar to https://github.com/facebookresearch/torcharrow/issues/64 and https://github.com/facebookresearch/torcharrow/issues/53

Arrow arrays supports unique: https://arrow.apache.org/docs/python/generated/pyarrow.compute.unique.html#pyarrow.compute.unique

>>> import pyarrow as pa

>>> a = pa.array([1, 2, 3, 2])
>>> a.unique()
<pyarrow.lib.Int64Array object at 0x7f89a065ed00>
[
  1,
  2,
  3
]

For sort, looks like first needed to get sorting indexing, and then reorder the elements: https://arrow.apache.org/docs/python/api/compute.html#sorts-and-partitions, and then use array selection methods: https://arrow.apache.org/docs/python/api/compute.html#selections

>>> import pyarrow as pa
>>> import pyarrow.compute as pac

>>> a = pa.array([1, 5, 7, 3, 2])
>>> pac.take(a, pac.array_sort_indices(a))
<pyarrow.lib.Int64Array object at 0x7f89a065edc0>
[
  1,
  2,
  3,
  5,
  7
]

wenleix avatar Nov 25 '21 04:11 wenleix