awkward icon indicating copy to clipboard operation
awkward copied to clipboard

uniques and distinct_counts in ak.str.*

Open jpivarski opened this issue 11 months ago • 1 comments

Description of new feature

Although pyarrow.compute.unique and pyarrow.compute.value_counds work for many data types, we could use them on strings only in the ak.str.* namespace.

Why not use them in general? Outside of the ak.str.* (and possible ak.dt.*) namespace, it would be surprising to encounter a function that does not work due to pyarrow not being installed. Also, I don't know how these functions would define equality for lists and records, with or without missing values. We'd want to know what semantics we're imposing.

We can already implement uniqueness and unique counts of primitive types with sorting and ak.run_lengths, so that wouldn't be a new ability. Doing uniqueness-counting on strings is an especially useful case; it would be a positive asset to add even that one case.

I must have overlooked it when scanning through lists of string functions; they're categorized differently. Are there any other functions that we could use in a string-only context in ak.str.*?

jpivarski avatar Sep 08 '23 16:09 jpivarski

I agree. There's already precedence for this with to_categorical, which could be used for non-strings but is subject to the same constraints.

agoose77 avatar Sep 08 '23 19:09 agoose77