PCG icon indicating copy to clipboard operation
PCG copied to clipboard

Efficient Character Representations

Open Boarders opened this issue 4 years ago • 5 comments

The problem

Currently our character data is represented as follows:

newtype ContinuousCharacter = CC (ExtendedReal, ExtendedReal)
-- where for the sake of completeness ExtendedReal is defined as:
newtype ExtendedReal = Cost Double

newtype StaticCharacter = SC BitVector

data  DynamicCharacter
    = Missing {-# UNPACK #-} !Word
    | DC      {-# UNPACK #-} !BitMatrix

Our bit vectors and bit matrices are based around @recursion-ninja's bv-little library and so these are represented as follows:

data  BitMatrix
    = BitMatrix {-# UNPACK #-} !Int {-# UNPACK #-} !BitVector

data  BitVector
    = BV
    { dim :: {-# UNPACK #-} !Word
    , nat :: !Natural
    }

Here the arbitrary precision natural number is represented in Haskell in accordance with the gmp library. This works extremely well for arbitrary alphabet sizes encoded as arbitrary length bit vectors but is space and time inefficient for some typical inputs where we know the precise alphabet size and it is small. For example, if we have DNA characters then these can be stored in a single Word8. This then has the added benefit that a block of such characters can be represented by an unboxed vector.

In order to measure this I took an input of the form:

input = fromList $ take n (cycle [1,2,3,4])

and wrote a fitch style operation (identical to the one we use on the Haskell side of our code):

fitch :: (Bits b) => b -> b -> (b, Word)
fitch l r
  | popCount (l .&. r) > 0 = (l .&. r, 0)
  | otherwise              = (l .|. r, 1)

which I then zipped over blocks of bitvectors encoded with the two different inputs (the benchmarks are here: https://github.com/Boarders/Char-Benchmarks/blob/master/Char_Bench.hs). I used a boxed vector of natural numbers as a stand-in for our current arbitrary length bit vectors and unboxed vectors of both Word8 and Word64 for possible new encodings.

If I take n = 100 (that is to say, a character block of length 100) then the results are as follows:

benchmarking characters/fitch-unboxed-Word8
time                 599.2 ns   (593.6 ns .. 608.8 ns)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 616.6 ns   (607.9 ns .. 627.8 ns)
std dev              31.77 ns   (27.83 ns .. 36.50 ns)
variance introduced by outliers: 69% (severely inflated)

benchmarking characters/fitch-unboxed-Word64
time                 777.8 ns   (770.5 ns .. 786.1 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 784.3 ns   (776.1 ns .. 793.6 ns)
std dev              29.44 ns   (22.43 ns .. 36.34 ns)
variance introduced by outliers: 53% (severely inflated)

benchmarking characters/fitch-boxed-natural
time                 2.191 μs   (2.141 μs .. 2.231 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 2.111 μs   (2.080 μs .. 2.147 μs)
std dev              112.9 ns   (87.44 ns .. 146.2 ns)
variance introduced by outliers: 68% (severely inflated)

Taking n = 100000 gives the following:

benchmarking characters/fitch-unboxed-Word8
time                 571.2 μs   (567.8 μs .. 575.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 577.5 μs   (573.3 μs .. 585.3 μs)
std dev              18.52 μs   (12.56 μs .. 27.81 μs)
variance introduced by outliers: 23% (moderately inflated)

benchmarking characters/fitch-unboxed-Word64
time                 727.1 μs   (721.5 μs .. 732.8 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 721.3 μs   (719.1 μs .. 723.9 μs)
std dev              7.784 μs   (6.223 μs .. 10.63 μs)

benchmarking characters/fitch-boxed-natural
time                 11.33 ms   (11.23 ms .. 11.42 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 11.52 ms   (11.46 ms .. 11.58 ms)
std dev              152.0 μs   (123.4 μs .. 194.4 μs)

Note here that, from the perspective of these operations, boxed vectors of arbitrary precision nats do not scale linearly whereas it looks like that is the case (in this range) for unboxed vectors (this is probably due to whether the input can be stored in L1/L2 cache and various other mysteries of modern hardware).

This suggests there is a lot to be gained by encoding shorter alphabets in fixed width types.

Proposal

I think we should take the following steps:

  • Introduce new character types of the form:
newtype StaticCharacter8  = StaticCharacter8 Word8
newtype StaticCharacter64 = StaticCharacter64 Word64
newtype StaticCharacter128 = StaticCharacter128 Word128   -- we should see if these
newtype StaticCharacter256 = StaticCharacter256 Word2256 -- pay their way in performance


data  DynamicCharacter8
    = Missing {-# UNPACK #-} !Word
    | DC8     {-# UNPACK #-} !(Unboxed.Vector Word8)
-- etc.

This will require the character blocks can easily be extensible with new types which the design in the new-graph-representation branch allows for. We can also make life easier by having lenses onto the various components of static and dynamic characters from a character block. That should make it seem as though not much has changed from the perspective of a high level user of the character sequence type.

  • For the continuous character I don't think there are many gains to be had and these benchmarks don't measure anything about those types but it is perhaps worth changing the tuple to a record of ExtendedReals which is strict in each field as this is usually an easy performance gain.

  • Using fixed width types opens up the possibility that a character block can use an unboxed vector. We can do that for all of the aforementioned characters and also we can write an Unbox instance for the continuous character type (which will, probably in turn, require such an instance to be written for an ExtendedReal`). This should offer good performance gains.

  • Above I have listed types Word128 and Word256. Most modern CPUs have single instructions for performing operations on 128 or 256 wide registers. For example I wrote (and tested) the following C code:

uint64_t* mmfitch (int len, const uint64_t* bv1, const uint64_t* bv2){
  uint64_t* res  __attribute__((aligned (16))) = malloc(len*8);
  uint64_t* start = res;
  for (int i = 0; i < (len + 1) / 2; i++){
    __m128i lvals = _mm_loadu_si128((__m128i*) bv1);
    __m128i rvals = _mm_loadu_si128((__m128i*) bv2);
    __m128i __mres;
    if (!_mm_testz_si128(lvals, rvals)){
        // if there is an intersection then return it.
      __mres = _mm_and_si128(lvals, rvals);
      }
    else {
      // otherwise take the union.
      __mres = _mm_or_si128 (lvals, rvals);
    }
    _mm_store_si128((__m128i*) res, __mres);
    bv1 +=2;
    bv2 +=2;
    res +=2;
  }
  return start;
}

This is elaborate as a result of the intel intrinsics, but if you squint it is just the usual fitch-style operation. When I have the time I am going test, via ffi, what the performance is like for this when called from Haskell. I should note that such SIMD instructions will eventually be added to the native code generator of Haskell and as such if we were to use any ffi code it would be (hopefully) short-lived. We will need to experiment with using this via ffi in Haskell to see whether it is actually worthwhile (I haven't done this yet as this function expects 16 aligned inputs and so some finessing needs to take place on the Haskell side beyond the ffi finessing).

Downsides

This leads to multiple different character representations which will lead to both more code with similar (but subtlety different) operations. We will probably have to write boxed and unboxed implementations for several operations (for example a fitch-style operation). This therefore will lead to more code and probably a large-ish re-factor (though i don't envision it being too painful). There are probably others I haven't thought about.

(Other) Upsides

Beyond those mentioned already in terms of performance, I think this type of project will lead to packed characters being much easier to add to PCG. On top of this, if the SIMD 128 bit operations have good performance and we can figure out the right sort of bit masking (also using intel intrinsics) then this can lead to potentially 16x single core parallelism on characters that fit in a single byte (though that doesn't take into account the masking cost). I think it is well worth exploring this possibility.

Boarders avatar Mar 25 '20 15:03 Boarders