Memory-efficient bitset-based BooleanSeries
Representing BooleanSeries internally as a bitset instead of boolean[] results in a very compact data structure (1/8 of byte vs 1 byte). Access is probably slower , but we can improve it by replacing the standard Java BitSet with our own simpler immutable bit set (long[] internally)
@stariy95 , thanks for the PR. I merged it, and actually replaced BoolAccum with BitSetAccum (so BooleanSeries are partially switched to BitSet). A couple of things we will need to improve to complete the switch:
-
See TODO in
BoolAccum.toSeries(). Instead of usingjava.util.BitSetand then cloning (a potentially huge) long array into our own BitSet, let's implement the Accum to work directly on along[](with transparent resizing like we do in ObjectAccum, etc.) -
Maybe we can create a simple API to build
BooleanBitsetSeriesof a known size bypassing the accum. See for instanceLongSeries.eq(..)method. Replacingboolean[] datawith aBoolAccumslows things downs 3.5x (partially due to the accum inefficiency above, but possibly also due to the CPU array optimizations that are not possible with an accum). Would be great to work directly with thelong[]and createBooleanBitsetSeriesfrom it. I guess we can do that already with direct bit arithmetic, but maybe we don't even need an internalBitSet.BooleanBitsetSerieswould become the bit set.
@andrus Here are some experiments with a direct creation of the BooleanBitsetSeries: https://github.com/dflib/dflib/pull/366
Not sure about naming, so it's marked as a draft for now.
Thanks, BTW just found some numbers for when we optimized for vectorization: https://github.com/dflib/dflib/issues/211 . Looks like the lambda version is still properly vectorized.