DataArrays.jl Allow hashing DataArrays with NA values

Allow hashing DataArrays with NA values

Open alyst opened this issue 9 years ago • 6 comments

Without this DataFrames.nonunique() and friends do not work on frames with NA rows.

Jun 15 '15 13:06 alyst

This seems like a good strategy. I'm confused why the old approach wouldn't have worked, though -- it seems like it should just be too slow.

Jun 15 '15 14:06 johnmyleswhite

For me on v0.4 the old code was throwing InexactError or something like that when NAs were hashed.

Jun 15 '15 14:06 alyst

I'm going to hold off merging for a bit to give others a chance to review, but the CI failure seems unrelated and this seems good to go.

Jun 15 '15 14:06 johnmyleswhite

I've just thought it might be better to use findnext(BitVector) to skip NAs. I can resubmit an improved version.

Jun 15 '15 14:06 alyst

Sounds good. I would do some profiling to make sure that it's worth the effort; for the almost no-NA case I imagine it will be meaningfully slower to use findnext.

Jun 15 '15 14:06 johnmyleswhite

OK, I've replaced the PR with the findnext() version. Both approaches should do more or less the same bit magic, so it's hard to say what would happen in the average "dense" case, but for "sparse" case findnext() should be faster.

Jun 15 '15 14:06 alyst

DataArrays.jl DataArrays.jl copied to clipboard

Allow hashing DataArrays with NA values

DataArrays.jl
DataArrays.jl copied to clipboard