DataArrays.jl icon indicating copy to clipboard operation
DataArrays.jl copied to clipboard

Allow hashing DataArrays with NA values

Open alyst opened this issue 9 years ago • 6 comments

Without this DataFrames.nonunique() and friends do not work on frames with NA rows.

alyst avatar Jun 15 '15 13:06 alyst

This seems like a good strategy. I'm confused why the old approach wouldn't have worked, though -- it seems like it should just be too slow.

johnmyleswhite avatar Jun 15 '15 14:06 johnmyleswhite

For me on v0.4 the old code was throwing InexactError or something like that when NAs were hashed.

alyst avatar Jun 15 '15 14:06 alyst

I'm going to hold off merging for a bit to give others a chance to review, but the CI failure seems unrelated and this seems good to go.

johnmyleswhite avatar Jun 15 '15 14:06 johnmyleswhite

I've just thought it might be better to use findnext(BitVector) to skip NAs. I can resubmit an improved version.

alyst avatar Jun 15 '15 14:06 alyst

Sounds good. I would do some profiling to make sure that it's worth the effort; for the almost no-NA case I imagine it will be meaningfully slower to use findnext.

johnmyleswhite avatar Jun 15 '15 14:06 johnmyleswhite

OK, I've replaced the PR with the findnext() version. Both approaches should do more or less the same bit magic, so it's hard to say what would happen in the average "dense" case, but for "sparse" case findnext() should be faster.

alyst avatar Jun 15 '15 14:06 alyst