DataArrays.jl icon indicating copy to clipboard operation
DataArrays.jl copied to clipboard

Semantics of unique and levels

Open johnmyleswhite opened this issue 11 years ago • 8 comments

The semantics of unique and levels are a mess at the moment, which causes the error Andreas sees in DataFrames:

julia> using DataArrays

julia> da = @data([1, 2, NA])
3-element DataArray{Int64,1}:
 1  
 2  
  NA

julia> pda = @pdata([1, 2, NA])
3-element PooledDataArray{Int64,Uint32,1}:
 1  
 2  
  NA

julia> unique(da)
3-element DataArray{Int64,1}:
  NA
 2  
 1  

julia> levels(da)
3-element DataArray{Int64,1}:
  NA
 2  
 1  

julia> unique(pda)
3-element DataArray{Int64,1}:
 1  
 2  
  NA

julia> levels(pda)
2-element Array{Int64,1}:
 1
 2

My preference is that unique should always return a DataArray of the same type containing all the unique values (including NA), whereas levels should always return an Array of the same type containing only the non-NA unique values. We're doing this for PDA's, but not for DA's.

johnmyleswhite avatar Dec 07 '13 16:12 johnmyleswhite

In c8e1653eea20e5dc45f76f939bec994d73311ab1, I implemented the semantics proposed above. If people are happy with that, this should be done.

johnmyleswhite avatar Dec 08 '13 04:12 johnmyleswhite

Thoughts on this, @simonster?

johnmyleswhite avatar Dec 08 '13 16:12 johnmyleswhite

This seems right given that unique has isequal semantics. I wonder whether unique(da; skipna=true) would be more discoverable than levels, but then we'd have to return a DataArray and not an Array for type stability since we don't get type inference for kwargs.

simonster avatar Dec 08 '13 19:12 simonster

I like that idea, although the absence of type specialization is frustrating.

johnmyleswhite avatar Dec 09 '13 03:12 johnmyleswhite

I find it problematic that unique(pda) gives all levels, even those that don't actually exist in the array. This is a problem when using unique on a sub-DataFrame with PDA columns. I currently have to work around it by converting to DataArray before calling unique.

HarlanH avatar Jan 02 '14 21:01 HarlanH

Agreed. Even for levels, it is often practical to automatically drop those that do not appear in a subset (though that's yet another issue).

nalimilan avatar Jan 02 '14 21:01 nalimilan

I definitely think that levels should just return the pool, as it currently does (and R does) but that unique should actually go through the effort of checking. At some level, levels should be the domain of the array. (See a bunch of very old discussion about metadata that could be associated with columns/DataArrays, somewhere in the DataFrames issue backlog...)

On Thu, Jan 2, 2014 at 4:56 PM, Milan Bouchet-Valat < [email protected]> wrote:

Agreed. Even for levels, it is often practical to automatically drop those that do not appear in a subset (though that's yet another issue).

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/DataArrays.jl/issues/29#issuecomment-31488294 .

HarlanH avatar Jan 02 '14 22:01 HarlanH

Yeah, this is how it works in R and that makes sense.

nalimilan avatar Jan 02 '14 22:01 nalimilan