dflib icon indicating copy to clipboard operation
dflib copied to clipboard

Memory-efficient bitset-based BooleanSeries

Open andrus opened this issue 1 year ago • 3 comments

Representing BooleanSeries internally as a bitset instead of boolean[] results in a very compact data structure (1/8 of byte vs 1 byte). Access is probably slower , but we can improve it by replacing the standard Java BitSet with our own simpler immutable bit set (long[] internally)

andrus avatar Sep 22 '24 16:09 andrus

@stariy95 , thanks for the PR. I merged it, and actually replaced BoolAccum with BitSetAccum (so BooleanSeries are partially switched to BitSet). A couple of things we will need to improve to complete the switch:

  1. See TODO in BoolAccum.toSeries(). Instead of using java.util.BitSet and then cloning (a potentially huge) long array into our own BitSet, let's implement the Accum to work directly on a long[] (with transparent resizing like we do in ObjectAccum, etc.)

  2. Maybe we can create a simple API to build BooleanBitsetSeries of a known size bypassing the accum. See for instance LongSeries.eq(..) method. Replacing boolean[] data with a BoolAccum slows things downs 3.5x (partially due to the accum inefficiency above, but possibly also due to the CPU array optimizations that are not possible with an accum). Would be great to work directly with the long[] and create BooleanBitsetSeries from it. I guess we can do that already with direct bit arithmetic, but maybe we don't even need an internal BitSet. BooleanBitsetSeries would become the bit set.

andrus avatar Oct 20 '24 16:10 andrus

@andrus Here are some experiments with a direct creation of the BooleanBitsetSeries: https://github.com/dflib/dflib/pull/366 Not sure about naming, so it's marked as a draft for now.

stariy95 avatar Oct 21 '24 16:10 stariy95

Thanks, BTW just found some numbers for when we optimized for vectorization: https://github.com/dflib/dflib/issues/211 . Looks like the lambda version is still properly vectorized.

andrus avatar Oct 21 '24 16:10 andrus