DataFrames.jl Provide `nunique` for Integers in `describe`

Provide `nunique` for Integers in `describe`

Open eirikbrandsaas opened this issue 1 year ago • 8 comments

Related discussion in https://github.com/JuliaData/DataFrames.jl/issues/2384

At least for the subset of Reals where there is no real problem with counting unique values (e.g., integers in contrast with Floats) this should be reported when the user asks for it?

in abstractdataframes.jl

if :nunique in stats
        if eltype(col) <: Real
            d[:nunique] = nothing
        else
            d[:nunique] = try length(Set(col)) catch end
        end
    end

Jul 11 '22 20:07 eirikbrandsaas

We could add :nunique! that would be the same as :nunique, but always calculated (which is dangerous which would be signaled by ! and the user would have to consciously choose it).

This is a solution I see to avoid breaking change. @nalimilan - what do you think?

This and https://github.com/JuliaData/DataFrames.jl/issues/3095 should be an easy change if we want it so we could put it in 1.4 release?

Jul 11 '22 21:07 bkamins

Unfortunately, integers have the same problem as floats: if you have many unique values it's super slow. An exception is small integer types (up to Int16) but these are not the most common case anyway.

Adding a new name would be OK though. I'm not sure :nunique! is really appropriate as ! generally indicates in-place operation, but any name could do (e.g. :nuniqueall).

@eirikbrandsaas What's your use case? I assume you have few unique values in your integer columns?

Jul 12 '22 07:07 nalimilan

:nuniqueall is fine with me.

Jul 12 '22 07:07 bkamins

@nalimilan In this case I had data with about 500 rows where I knew ~99% of row values would be the same for some columns. But I wanted to check, and then describe(df,:nunique) sounded pretty natural

More generally, I find this useful when working with messy data. At least I often use the info Stata provides when you "codebook" a variable, including number of unique values (say identification numbers in a panel data set) when working with messy survey data.

More philosophically, I don't see the problem with slow calculation. if somebody asks for the number of unique elements, I don't see why the software doesn't give it to just because it's slow? Any statistical operation is slow in the limit.

That said, as evident from this amazing package, you know what you're doing!

Jul 12 '22 12:07 eirikbrandsaas

Sorry to bump it again, but three more comments:

It doesn't work for Boolean either:

df_tmp = DataFrame(boolean = Bool.(round.(Int,rand(20))))
describe(df_tmp,:nunique)
1×2 DataFrame
 Row │ variable  nunique 
     │ Symbol    Nothing 
─────┼───────────────────
   1 │ boolean

the more I think about it the less I get the argument that this operation can be costly. I can do groupby(df,:x) when :x is a Float, which is also costly. In general, not clear why that is "your business" so to speak, especially as long as nunique isn't part of the default describe(df) call. I don't know anything about sorting algorithms, but is it really that much cheaper to find unique elements of a string?:

 describe(DataFrame(boolean = Bool.(round.(Int,rand(20))),str =string.(rand(20))),:nunique)
2×2 DataFrame
 Row │ variable  nunique 
     │ Symbol    Union…  
─────┼───────────────────
   1 │ boolean           
   2 │ str       20

A third use case is that sometimes papers report summary statistics that sometimes report the number of unique elements of each variable. I can get all the other statistics except this one through describe().

Jul 21 '22 13:07 eirikbrandsaas

Counting unique values for strings is more expensive than for floats.
The original reasoning was that in very wide tables it is very rare that they have many string columns, rather they have many float columns - and this is a problematic case.
as you can see in this issue metadata this feature has been added for 1.4 release, so it will be added soon. The current plan is to call what you ask for :nuniqueall to avoid breaking change.

Jul 21 '22 17:07 bkamins

Perfect, thanks!

Jul 21 '22 20:07 eirikbrandsaas

FWIW, Base (unfortunately, imho) uses the word unique, but in general I think the right word for this concept is not unique but distinct.

Jul 22 '22 19:07 jariji

DataFrames.jl DataFrames.jl copied to clipboard

Provide `nunique` for Integers in `describe`

DataFrames.jl
DataFrames.jl copied to clipboard