PCG
PCG copied to clipboard
Efficient Character Representations
The problem
Currently our character data is represented as follows:
newtype ContinuousCharacter = CC (ExtendedReal, ExtendedReal)
-- where for the sake of completeness ExtendedReal is defined as:
newtype ExtendedReal = Cost Double
newtype StaticCharacter = SC BitVector
data DynamicCharacter
= Missing {-# UNPACK #-} !Word
| DC {-# UNPACK #-} !BitMatrix
Our bit vectors and bit matrices are based around @recursion-ninja's bv-little library and so these are represented as follows:
data BitMatrix
= BitMatrix {-# UNPACK #-} !Int {-# UNPACK #-} !BitVector
data BitVector
= BV
{ dim :: {-# UNPACK #-} !Word
, nat :: !Natural
}
Here the arbitrary precision natural number is represented in Haskell in accordance with the gmp library. This works extremely well for arbitrary alphabet sizes encoded as arbitrary length bit vectors but is space and time inefficient for some typical inputs where we know the precise alphabet size and it is small. For example, if we have DNA characters then these can be stored in a single Word8
. This then has the added benefit that a block of such characters can be represented by an unboxed vector.
In order to measure this I took an input of the form:
input = fromList $ take n (cycle [1,2,3,4])
and wrote a fitch style operation (identical to the one we use on the Haskell side of our code):
fitch :: (Bits b) => b -> b -> (b, Word)
fitch l r
| popCount (l .&. r) > 0 = (l .&. r, 0)
| otherwise = (l .|. r, 1)
which I then zipped over blocks of bitvectors encoded with the two different inputs (the benchmarks are here: https://github.com/Boarders/Char-Benchmarks/blob/master/Char_Bench.hs). I used a boxed vector of natural numbers as a stand-in for our current arbitrary length bit vectors and unboxed vectors of both Word8 and Word64 for possible new encodings.
If I take n = 100
(that is to say, a character block of length 100) then the results are as follows:
benchmarking characters/fitch-unboxed-Word8
time 599.2 ns (593.6 ns .. 608.8 ns)
0.998 R² (0.997 R² .. 0.999 R²)
mean 616.6 ns (607.9 ns .. 627.8 ns)
std dev 31.77 ns (27.83 ns .. 36.50 ns)
variance introduced by outliers: 69% (severely inflated)
benchmarking characters/fitch-unboxed-Word64
time 777.8 ns (770.5 ns .. 786.1 ns)
0.999 R² (0.999 R² .. 1.000 R²)
mean 784.3 ns (776.1 ns .. 793.6 ns)
std dev 29.44 ns (22.43 ns .. 36.34 ns)
variance introduced by outliers: 53% (severely inflated)
benchmarking characters/fitch-boxed-natural
time 2.191 μs (2.141 μs .. 2.231 μs)
0.997 R² (0.996 R² .. 0.998 R²)
mean 2.111 μs (2.080 μs .. 2.147 μs)
std dev 112.9 ns (87.44 ns .. 146.2 ns)
variance introduced by outliers: 68% (severely inflated)
Taking n = 100000
gives the following:
benchmarking characters/fitch-unboxed-Word8
time 571.2 μs (567.8 μs .. 575.8 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 577.5 μs (573.3 μs .. 585.3 μs)
std dev 18.52 μs (12.56 μs .. 27.81 μs)
variance introduced by outliers: 23% (moderately inflated)
benchmarking characters/fitch-unboxed-Word64
time 727.1 μs (721.5 μs .. 732.8 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 721.3 μs (719.1 μs .. 723.9 μs)
std dev 7.784 μs (6.223 μs .. 10.63 μs)
benchmarking characters/fitch-boxed-natural
time 11.33 ms (11.23 ms .. 11.42 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 11.52 ms (11.46 ms .. 11.58 ms)
std dev 152.0 μs (123.4 μs .. 194.4 μs)
Note here that, from the perspective of these operations, boxed vectors of arbitrary precision nats do not scale linearly whereas it looks like that is the case (in this range) for unboxed vectors (this is probably due to whether the input can be stored in L1/L2 cache and various other mysteries of modern hardware).
This suggests there is a lot to be gained by encoding shorter alphabets in fixed width types.
Proposal
I think we should take the following steps:
- Introduce new character types of the form:
newtype StaticCharacter8 = StaticCharacter8 Word8
newtype StaticCharacter64 = StaticCharacter64 Word64
newtype StaticCharacter128 = StaticCharacter128 Word128 -- we should see if these
newtype StaticCharacter256 = StaticCharacter256 Word2256 -- pay their way in performance
data DynamicCharacter8
= Missing {-# UNPACK #-} !Word
| DC8 {-# UNPACK #-} !(Unboxed.Vector Word8)
-- etc.
This will require the character blocks can easily be extensible with new types which the design in the new-graph-representation
branch allows for. We can also make life easier by having lenses onto the various components of static and dynamic characters from a character block. That should make it seem as though not much has changed from the perspective of a high level user of the character sequence type.
-
For the continuous character I don't think there are many gains to be had and these benchmarks don't measure anything about those types but it is perhaps worth changing the tuple to a record of
ExtendedReal
s which is strict in each field as this is usually an easy performance gain. -
Using fixed width types opens up the possibility that a character block can use an unboxed vector. We can do that for all of the aforementioned characters and also we can write an
Unbox
instance for the continuous character type (which will, probably in turn, require such an instance to be written for an ExtendedReal`). This should offer good performance gains. -
Above I have listed types
Word128
andWord256
. Most modern CPUs have single instructions for performing operations on 128 or 256 wide registers. For example I wrote (and tested) the following C code:
uint64_t* mmfitch (int len, const uint64_t* bv1, const uint64_t* bv2){
uint64_t* res __attribute__((aligned (16))) = malloc(len*8);
uint64_t* start = res;
for (int i = 0; i < (len + 1) / 2; i++){
__m128i lvals = _mm_loadu_si128((__m128i*) bv1);
__m128i rvals = _mm_loadu_si128((__m128i*) bv2);
__m128i __mres;
if (!_mm_testz_si128(lvals, rvals)){
// if there is an intersection then return it.
__mres = _mm_and_si128(lvals, rvals);
}
else {
// otherwise take the union.
__mres = _mm_or_si128 (lvals, rvals);
}
_mm_store_si128((__m128i*) res, __mres);
bv1 +=2;
bv2 +=2;
res +=2;
}
return start;
}
This is elaborate as a result of the intel intrinsics, but if you squint it is just the usual fitch-style operation. When I have the time I am going test, via ffi, what the performance is like for this when called from Haskell. I should note that such SIMD instructions will eventually be added to the native code generator of Haskell and as such if we were to use any ffi code it would be (hopefully) short-lived. We will need to experiment with using this via ffi in Haskell to see whether it is actually worthwhile (I haven't done this yet as this function expects 16 aligned inputs and so some finessing needs to take place on the Haskell side beyond the ffi finessing).
Downsides
This leads to multiple different character representations which will lead to both more code with similar (but subtlety different) operations. We will probably have to write boxed and unboxed implementations for several operations (for example a fitch-style operation). This therefore will lead to more code and probably a large-ish re-factor (though i don't envision it being too painful). There are probably others I haven't thought about.
(Other) Upsides
Beyond those mentioned already in terms of performance, I think this type of project will lead to packed characters being much easier to add to PCG. On top of this, if the SIMD 128 bit operations have good performance and we can figure out the right sort of bit masking (also using intel intrinsics) then this can lead to potentially 16x single core parallelism on characters that fit in a single byte (though that doesn't take into account the masking cost). I think it is well worth exploring this possibility.