simdzone icon indicating copy to clipboard operation
simdzone copied to clipboard

Optimize (vectorize?) integer deserialization

Open k0ekk0ek opened this issue 2 years ago • 15 comments

Currently, converting numbers to binary is done the idiomatic way. Since integers (int8, int16 and int32) occur quite a lot in zone files, e.g. TTL, optimization may result in a nice peformance boost.

To tackle this, it's good to know that for some integer values, use of a symbolic value is allowed to. e.g. algorithm field in DS records. Also, TTL values are 32-bit integer fields, but the MSB must not be set.

k0ekk0ek avatar Feb 20 '23 11:02 k0ekk0ek

My thoughts on parsing integers start here: https://stackoverflow.com/a/74453028 I don't know if it actually makes anything faster.

aqrit avatar Jun 17 '23 13:06 aqrit

Thanks @aqrit! I'll be sure to give this a try down the road. Most RRs don't contain integer data (people tend to specify the TTL once), but while wrinting #68, I did see some types where this may prove very useful.

k0ekk0ek avatar Jun 19 '23 08:06 k0ekk0ek

@aqrit, looked at your posts (and mentioned posts/articles) quite a bit. I've changed the function to detect invalid characters too. This is a first stab at the problem, I've not benchmarked it. For simdzone, we know what type of integer (8 bit, 16 bit or 32 bit) we're going to parse, so chances are we can do much better.

static int parse_integer(const char *buffer, size_t length, uint64_t *number)
{
  uint64_t digits, bad_chars;
  static const uint8_t shift[] =
    { 0,  56, 48, 40, 32, 24, 16, 8,  0,  0,  0,  0,  0,  0,  0,  0 };

  length &= 0xfull;

  memcpy(&digits, buffer, sizeof(digits)); // assumes little-endian
  // flip 0x30 (ASCII 0 (0x30) - 9 (0x39)) bits, detect non-digits
  digits ^= 0x3030303030303030ull;
  // shift off non-digits
  digits <<= shift[length];

  bad_chars = (digits | (digits + 0x0606060606060606ull)) & 0xf0f0f0f0f0f0f0f0ull;

  digits = (digits & 0x0f0f0f0f0f0f0f0fllu) * 2561 >> 8;
  digits = (digits & 0x00ff00ff00ff00ffllu) * 6553601 >> 16;
  digits = (digits & 0x0000ffff0000ffffllu) * 42949672960001 >> 32;

  *number = digits;

  return (bad_chars == 0);
}

int main(int argc, char *argv[])
{
  char buffer[16] = { 0 }; // padding guaranteed in simdzone
  size_t length;
  if (argc == 2)
    length = strlen(argv[1]);
  else
    length = 0;

  memcpy(buffer, argv[1], length&0xfull);

  uint64_t number;

  if (parse_integer(buffer, length, &number)) {
    fprintf(stdout, "number: %llu\n", number);
    return 0;
  } else {
    fprintf(stderr, "Invalid number\n");
    return -1;
  }
}

k0ekk0ek avatar Nov 22 '23 09:11 k0ekk0ek

I found that with validation required, it's incredibly hard to beat the naive implementation. I focused on int8 (maximum of 255) first. What I found consistently faster is:

static int parse_int8_swar(const char *str, size_t len, uint8_t *num)
{
  union { uint8_t as_str[4]; uint32_t as_int; } digits;

  memcpy(&digits.as_int, str, sizeof(digits));
  // flip 0x30, detect non-digits
  digits.as_int ^= 0x30303030lu;
  // shift off trash bytes
  digits.as_int <<= (4 - (len & 0x3)) * 8;

  const uint32_t x = digits.as_str[1] * 10 + digits.as_str[2];
  const uint32_t y = x * 10 + digits.as_str[3];
  *num = (uint8_t)y;
  return (digits.as_int & 0xf0f0f0f0) == 0 && y < 256 && len && len < 4;
}

Where naive is:

__attribute__((noinline))
static int parse_int8_naive(const char *str, size_t len, uint8_t *num)
{
  uint32_t n = 0;

  for (size_t i=0, r=len&0x3; i < r; i++) {
    uint8_t d = str[i] - '0';
    if (d > 9)
      return 0;
    n = n * 10 + d;
  }

  *num = (uint8_t)n;
  return n < 256 && len && len < 4;
}

Don't know if it's worth the trouble. @aqrit, @lemire, any pointers?

k0ekk0ek avatar Nov 28 '23 10:11 k0ekk0ek

Here is my attempt:

https://lemire.me/blog/2023/11/28/parsing-8-bit-integers-quickly/

lemire avatar Nov 28 '23 20:11 lemire

(My tests suggest that you can definitively do much better than naive with SWAR.)

lemire avatar Nov 28 '23 20:11 lemire

Great stuff @lemire. Thanks!

k0ekk0ek avatar Nov 29 '23 08:11 k0ekk0ek

@k0ekk0ek I am tweaking it a bit since the initial version did not fully validate the input.

lemire avatar Nov 29 '23 13:11 lemire

I forgot to add 0x06060606 after the initial XOR back to check for character greater than '9'. That may not cover all scenarios though, i.e. when the initial value is 11001111, then it'd overflow.

k0ekk0ek avatar Nov 29 '23 13:11 k0ekk0ek

I recommend the following...


int parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union { uint8_t as_str[4]; uint32_t as_int; } digits;
  memcpy(&digits.as_int, str, sizeof(digits));
  digits.as_int ^= 0x30303030lu;
  digits.as_int <<= ((4 - len) * 8);
  uint32_t all_digits = ((digits.as_int & 0xf0f0f0f0) |
                         ((0x76767676 + digits.as_int) & 0x80808080)) == 0;
  *num = (uint8_t)((UINT64_C(0x640a0100) * digits.as_int) >> 32);
  return all_digits & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

I think it covers everything.

lemire avatar Nov 29 '23 14:11 lemire

Slightly more efficient...

int parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
  if(len == 0 || len > 3) { return 0; }
  union {
    uint8_t as_str[4];
    uint32_t as_int;
  } digits;

  memcpy(&digits.as_int, str, sizeof(digits));
  // flip 0x30, detect non-digits
  digits.as_int ^= 0x30303030lu;
  // shift off trash bytes, technically undefined behaviour when ((4 - len) * 8) is not
  // in [0,32) prior to C17 / C++14.
  digits.as_int <<= ((4 - len) * 8);
  // check all digits
  uint32_t all_digits = ((digits.as_int | (0x06060606 + digits.as_int)) & 0xF0F0F0F0) == 0;
  *num = (uint8_t)((0x640a01 * digits.as_int) >> 24);
  return all_digits & ((__builtin_bswap32(digits.as_int) <= 0x020505));
}

This is now squarely 2x faster than naive on random inputs, and still measurably faster on predictable inputs.

It is likely practical.

lemire avatar Nov 29 '23 19:11 lemire

Thanks @lemire! It's amazing how much time one can spend on this stuff.

I have been trying to workaround the shift (note that parse_uint8_fastswar_bob suffers from the same problem in case len is greater). e.g. ((len & 3) ^ 3) + 1) * 8, (0x0d << (len & 3)) & 0x18) and a shift table. However, they are all slower than what's currently there (dependency chain?). Maybe there's an alternative to get the bytes in the "right" location. For simdzone, a len of 0 (zero) is not really a problem as in that case there won't be a token.

~~There's one case we haven't covered though. Leading zeroes. i.e. 0 is allowed, but 01 is not.~~ Leading zeroes actually are allowed. At least, BIND and Knot (and the current parser in NSD), do.

I'll give this some more thought at a later stage.

k0ekk0ek avatar Nov 30 '23 10:11 k0ekk0ek

Or, maybe it's just better solved in SIMD? Trying to get things to go "fast" for the fallback parser may not be the right choice(?) From a practical perspective, users won't notice as no-one is using CPU's that don't support SIMD for real work. Anyway, I'll have to polish the current code for the first release, I'll work on this more afterwards. (I have some vacation coming up :slightly_smiling_face:)

k0ekk0ek avatar Nov 30 '23 11:11 k0ekk0ek

There's one case we haven't covered though. Leading zeroes. i.e. 0 is allowed, but 01 is not.

Note that this is true of the naïve version as well.

lemire avatar Nov 30 '23 13:11 lemire

Random blog post I came across. May be worth investigating at some point.

k0ekk0ek avatar Feb 20 '24 13:02 k0ekk0ek