ssz icon indicating copy to clipboard operation
ssz copied to clipboard

Fast conversion from decimal to bytes

Open dapplion opened this issue 3 years ago • 8 comments

Lodestar spends significant CPU time converting uint types from decimal to bytes and back. For example, converting the balance tree to flat and back consist of doing 200_000 ops of those.

SSZ is the one hosting the code that does the conversions. This is one of the implementations that participate in this conversions.

https://github.com/ChainSafe/ssz/blob/be03a3398e9c13b4d6c609ba07dfba7bf2cfe2a6/src/types/basic/uint.ts#L68-L82

https://github.com/ChainSafe/ssz/blob/be03a3398e9c13b4d6c609ba07dfba7bf2cfe2a6/src/types/basic/uint.ts#L84-L98

We should investigate faster alternatives

dapplion avatar Aug 10 '21 14:08 dapplion

will investigate :+1:

g11tech avatar Aug 10 '21 16:08 g11tech

with https://github.com/ChainSafe/persistent-merkle-tree/pull/52, we can get deserialized number if it contains less than 32 bits (right now we have to convert h0..h7 to Uint8Array and from Uint8Array back to number)

  it.only("get deserialized number from h0", () => {
    const BeaconState = new ContainerType({
      fields: {
        slot: number64Type,
      }
    });
    for (let slot = 2 ** 31 - 100; slot <= 2 ** 31; slot++) {
      const tb = BeaconState.createTreeBackedFromStruct({slot});
      const node = tb.tree.getNode(BigInt(1));
      expect(Math.abs(node.h0)).to.be.equal(slot, "failed at slot " + slot);
    }
  });

it should work with slot, committee index etc. but not with balance

twoeths avatar Aug 11 '21 07:08 twoeths

I've tested using TypedArrays to convert numbers and it's actually slower than the current implementation in as-sha256

  utils
    ✓ hashObjectToByteArray 50023 times                                   407.5760 ops/s    2.453530 ms/op   x1.406       4073 runs   10.0 s
    ✓ byteArrayToHashObject 50023 times                                   401.1899 ops/s    2.492585 ms/op   x1.174       4009 runs   10.0 
/**
 * Pass 8 numbers in an object and set that to inputArray.
 * This function contains multiple same procedures but we intentionally
 * do it step by step to improve performance a bit.
 **/
export function hashObjectToByteArray(obj, byteArr, offset) {
  const uint32 = new Uint32Array(byteArr.buffer);

  uint32[0] = obj.h0;
  uint32[1] = obj.h1;
  uint32[2] = obj.h2;
  uint32[3] = obj.h3;
  uint32[4] = obj.h4;
  uint32[5] = obj.h5;
  uint32[6] = obj.h6;
  uint32[7] = obj.h7;
}

/**
 * Parse outputArray into an object of 8 numbers.
 * This is the order that makes Uint32Array the same to Uint8Array
 * This function contains multiple same procedures but we intentionally
 * do it step by step to improve performance a bit.
 **/
export function byteArrayToHashObject(byteArr) {
  const uint32 = new Uint32Array(byteArr.buffer);

  return {
    h0: uint32[0],
    h1: uint32[1],
    h2: uint32[2],
    h3: uint32[3],
    h4: uint32[4],
    h5: uint32[5],
    h6: uint32[6],
    h7: uint32[7],
  };
}

What makes it so slow is the constructor Uint32Array. It takes 90% of the CPU time of that function. Then setting or reading numbers is very fast.

dapplion avatar Aug 12 '21 19:08 dapplion

@dapplion just to confim, the endianness in ssz serialization is little?

g11tech avatar Aug 20 '21 09:08 g11tech

@dapplion just to confim, the endianness in ssz serialization is little?

Yes it's little endian

dapplion avatar Aug 22 '21 10:08 dapplion

  1. as first step, it seems that dataview based to_bytes is at least 2 times faster than the current to_bytes loop for >32 bit value for unit64 and 3.5 times better than <32 bit value for unit64 however when i plugged it in the speedup was just 2.5%-3.5% (another 2.5%-3.5% can potentially come in the reverse leg from bytes to number so may be 5-7% overall with to bytes optimization).

image

  1. Will continue work on the other side leg (bytes to number) and corroborate the above hypo of potential 7% speedup.
  2. while figuring out the optimization and code flow, established other points for potential optimization regarding the use of dataview but will do further code drilldown before establishing the viability of this scenario.
  3. as discussed before, another potential optimization in making merkelization multicore optimized. will tackle that in separate PR.

g11tech avatar Aug 22 '21 19:08 g11tech

@g11tech Thanks! Keep us updated

dapplion avatar Aug 23 '21 08:08 dapplion

@dapplion got 88% speedup on normal uint tree backed (will need to check on > 2^32 number increment test but should be 30% plus over there), marginal 4% better performance on the new uint64 tree backed. will continue checking this further a bit more.

before opt image

after opt: image

g11tech avatar Aug 29 '21 11:08 g11tech