DataFrames.jl
DataFrames.jl copied to clipboard
Provide `nunique` for Integers in `describe`
Related discussion in https://github.com/JuliaData/DataFrames.jl/issues/2384
At least for the subset of Reals where there is no real problem with counting unique values (e.g., integers in contrast with Floats) this should be reported when the user asks for it?
in abstractdataframes.jl
if :nunique in stats
if eltype(col) <: Real
d[:nunique] = nothing
else
d[:nunique] = try length(Set(col)) catch end
end
end
We could add :nunique!
that would be the same as :nunique
, but always calculated (which is dangerous which would be signaled by !
and the user would have to consciously choose it).
This is a solution I see to avoid breaking change. @nalimilan - what do you think?
This and https://github.com/JuliaData/DataFrames.jl/issues/3095 should be an easy change if we want it so we could put it in 1.4 release?
Unfortunately, integers have the same problem as floats: if you have many unique values it's super slow. An exception is small integer types (up to Int16
) but these are not the most common case anyway.
Adding a new name would be OK though. I'm not sure :nunique!
is really appropriate as !
generally indicates in-place operation, but any name could do (e.g. :nuniqueall
).
@eirikbrandsaas What's your use case? I assume you have few unique values in your integer columns?
:nuniqueall
is fine with me.
@nalimilan In this case I had data with about 500 rows where I knew ~99% of row values would be the same for some columns. But I wanted to check, and then describe(df,:nunique)
sounded pretty natural
More generally, I find this useful when working with messy data. At least I often use the info Stata provides when you "codebook" a variable, including number of unique values (say identification numbers in a panel data set) when working with messy survey data.
More philosophically, I don't see the problem with slow calculation. if somebody asks for the number of unique elements, I don't see why the software doesn't give it to just because it's slow? Any statistical operation is slow in the limit.
That said, as evident from this amazing package, you know what you're doing!
Sorry to bump it again, but three more comments:
- It doesn't work for Boolean either:
df_tmp = DataFrame(boolean = Bool.(round.(Int,rand(20))))
describe(df_tmp,:nunique)
1×2 DataFrame
Row │ variable nunique
│ Symbol Nothing
─────┼───────────────────
1 │ boolean
- the more I think about it the less I get the argument that this operation can be costly. I can do
groupby(df,:x)
when:x
is aFloat
, which is also costly. In general, not clear why that is "your business" so to speak, especially as long asnunique
isn't part of the defaultdescribe(df)
call. I don't know anything about sorting algorithms, but is it really that much cheaper to find unique elements of a string?:
describe(DataFrame(boolean = Bool.(round.(Int,rand(20))),str =string.(rand(20))),:nunique)
2×2 DataFrame
Row │ variable nunique
│ Symbol Union…
─────┼───────────────────
1 │ boolean
2 │ str 20
- A third use case is that sometimes papers report summary statistics that sometimes report the number of unique elements of each variable. I can get all the other statistics except this one through
describe()
.
- Counting unique values for strings is more expensive than for floats.
- The original reasoning was that in very wide tables it is very rare that they have many string columns, rather they have many float columns - and this is a problematic case.
- as you can see in this issue metadata this feature has been added for 1.4 release, so it will be added soon. The current plan is to call what you ask for
:nuniqueall
to avoid breaking change.
Perfect, thanks!
FWIW, Base
(unfortunately, imho) uses the word unique
, but in general I think the right word for this concept is not unique
but distinct
.