DataArrays.jl icon indicating copy to clipboard operation
DataArrays.jl copied to clipboard

set operations for DataArrays

Open WestleyArgentum opened this issue 10 years ago • 4 comments

It would be nice to have versions of the standard set operations (union, setdiff, intersect, etc) defined for DataArrays so that dealing with NAs was easier. Right now the functions try to output regular arrays of the same type as the DataArray, and run into problems when they can't convert NA to said type:

julia> setdiff(@data([1,2,3,4]), @data([3,4,5,NA]))
ERROR: `convert` has no method matching convert(::Type{Int64}, ::NAtype)
 in setindex! at dict.jl:545
 in push! at set.jl:18
 in union! at set.jl:23
 in setdiff at array.jl:1358

WestleyArgentum avatar Nov 15 '14 19:11 WestleyArgentum

To fit with how NA is treated conceptually throughout the package, when either argument has any NAs, setdiff pretty much has to throw an error. (Set(DataVector[..., NA, ...]) is unknown, and while defining Set(DataVector) would be convenient, I think the principle that has guided design here and in DataFrames favors forcing explicit handling of NAs to avoid introducing subtle bugs).

Possible options -- friendlier error message, or a more convenient syntax than setdiff(each_dropna(dv1), each_dropna(dv2))

garborg avatar Nov 15 '14 21:11 garborg

Probably setdiff(dv1, dv2, skipna=true) would be nicer and more consistent with e.g. mapreduce?

nalimilan avatar Nov 15 '14 22:11 nalimilan

That's nicer, for sure, and defining the three methods for each function would allow for tuning. On the other hand, it sounds like the quantity of methods that need to be reimplemented has slowed down development of data/nullable/categorical/etc. container types, so maybe slicker iterators would give more mileage: e.g.

setdiff(v, dv, skipna = true) # =>
setdiff(v, skipna(dv)) # or
setdiff(v, SkipNA(dv))

setdiff(dv1, dv2, skipna = true) # =>
setdiff(skipna(dv1), skipna(dv2)) # or
setdiff(SkipNA(dv1), SkipNA(dv2))

garborg avatar Nov 15 '14 22:11 garborg

One thing I'd like to do now that Nullable is in Base is to define iterators like skipnulls that live outside of DataArrays (since you might need to apply them to Array{Any}, for example) and lean on those iterators more heavily.

In general, we could do a much better job of making clear when functions can accept arbitrary iterators and when they need a fully materialized input to work.

johnmyleswhite avatar Nov 17 '14 15:11 johnmyleswhite