DataArrays.jl
DataArrays.jl copied to clipboard
set operations for DataArrays
It would be nice to have versions of the standard set operations (union, setdiff, intersect, etc) defined for DataArrays so that dealing with NAs was easier. Right now the functions try to output regular arrays of the same type as the DataArray, and run into problems when they can't convert NA to said type:
julia> setdiff(@data([1,2,3,4]), @data([3,4,5,NA]))
ERROR: `convert` has no method matching convert(::Type{Int64}, ::NAtype)
in setindex! at dict.jl:545
in push! at set.jl:18
in union! at set.jl:23
in setdiff at array.jl:1358
To fit with how NA
is treated conceptually throughout the package, when either argument has any NA
s, setdiff pretty much has to throw an error. (Set(DataVector[..., NA, ...])
is unknown, and while defining Set(DataVector) would be convenient, I think the principle that has guided design here and in DataFrames favors forcing explicit handling of NAs to avoid introducing subtle bugs).
Possible options -- friendlier error message, or a more convenient syntax than setdiff(each_dropna(dv1), each_dropna(dv2))
Probably setdiff(dv1, dv2, skipna=true)
would be nicer and more consistent with e.g. mapreduce
?
That's nicer, for sure, and defining the three methods for each function would allow for tuning. On the other hand, it sounds like the quantity of methods that need to be reimplemented has slowed down development of data/nullable/categorical/etc. container types, so maybe slicker iterators would give more mileage: e.g.
setdiff(v, dv, skipna = true) # =>
setdiff(v, skipna(dv)) # or
setdiff(v, SkipNA(dv))
setdiff(dv1, dv2, skipna = true) # =>
setdiff(skipna(dv1), skipna(dv2)) # or
setdiff(SkipNA(dv1), SkipNA(dv2))
One thing I'd like to do now that Nullable is in Base is to define iterators like skipnulls
that live outside of DataArrays (since you might need to apply them to Array{Any}
, for example) and lean on those iterators more heavily.
In general, we could do a much better job of making clear when functions can accept arbitrary iterators and when they need a fully materialized input to work.