DataArrays.jl
DataArrays.jl copied to clipboard
Semantics of unique and levels
The semantics of unique and levels are a mess at the moment, which causes the error Andreas sees in DataFrames:
julia> using DataArrays
julia> da = @data([1, 2, NA])
3-element DataArray{Int64,1}:
1
2
NA
julia> pda = @pdata([1, 2, NA])
3-element PooledDataArray{Int64,Uint32,1}:
1
2
NA
julia> unique(da)
3-element DataArray{Int64,1}:
NA
2
1
julia> levels(da)
3-element DataArray{Int64,1}:
NA
2
1
julia> unique(pda)
3-element DataArray{Int64,1}:
1
2
NA
julia> levels(pda)
2-element Array{Int64,1}:
1
2
My preference is that unique should always return a DataArray of the same type containing all the unique values (including NA), whereas levels should always return an Array of the same type containing only the non-NA unique values. We're doing this for PDA's, but not for DA's.
In c8e1653eea20e5dc45f76f939bec994d73311ab1, I implemented the semantics proposed above. If people are happy with that, this should be done.
Thoughts on this, @simonster?
This seems right given that unique has isequal semantics. I wonder whether unique(da; skipna=true) would be more discoverable than levels, but then we'd have to return a DataArray and not an Array for type stability since we don't get type inference for kwargs.
I like that idea, although the absence of type specialization is frustrating.
I find it problematic that unique(pda) gives all levels, even those that don't actually exist in the array. This is a problem when using unique on a sub-DataFrame with PDA columns. I currently have to work around it by converting to DataArray before calling unique.
Agreed. Even for levels, it is often practical to automatically drop those that do not appear in a subset (though that's yet another issue).
I definitely think that levels should just return the pool, as it
currently does (and R does) but that unique should actually go through
the effort of checking. At some level, levels should be the domain of the
array. (See a bunch of very old discussion about metadata that could be
associated with columns/DataArrays, somewhere in the DataFrames issue
backlog...)
On Thu, Jan 2, 2014 at 4:56 PM, Milan Bouchet-Valat < [email protected]> wrote:
Agreed. Even for levels, it is often practical to automatically drop those that do not appear in a subset (though that's yet another issue).
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/DataArrays.jl/issues/29#issuecomment-31488294 .
Yeah, this is how it works in R and that makes sense.