num-bigint icon indicating copy to clipboard operation
num-bigint copied to clipboard

Use SmallVec to optimize for small integer sizes.

Open lemmih opened this issue 3 years ago • 5 comments

This is a proof of concept. I'd like to discuss possible designs and pathways to getting something like this merged.

Motivation

Using BigInt with mostly small (<= word-sized) integers is prohibitively expensive. One numerically intense algorithm from rgeometry is 1000 times slower when using BigRational instead of a fixed-precision number.

Performance

num-bigint performs quite well for large input and is competitive with the GMP. Detailed benchmarks can be found here: https://github.com/Lemmih/num-criterion

So why do small integers perform so poorly? Mostly because allocating new memory is significantly more expensive than doing a single addition or multiplication. There are also a few cases where we can use specialized algorithms. For example, there's a particularly fast gcd algorithm that only works on word-sized integers.

SmallVec with two inlined BigDigits

BigUint is a Vec<BigDigit>. As such, it has a size of 3 words in addition to the actual digits. Switching to a SmallVec<[BigDigit; 2]> will not increase the size of a BigUint but allows for two BigDigits to be inlined and thus not require a heap allocation.

Summary of benchmarks:

I've run benchmarks on an M1 Apple MacBook Air and a 3950X AMD desktop. The results I get from these platforms are quite different and I'd love it if other people could help out by running the benchmark on their machines. For instructions, see: https://github.com/Lemmih/num-criterion

  • GCD: Slightly faster across all bit sizes. This is mainly because I wrote a more efficient version of shr_assign. The previous version interacted poorly with SmallVec.
  • Mul: Significantly faster for bit sizes <=64. Slightly slower for 128-bit integers. No significant difference for larger integers.
  • Div: Significantly faster for bit sizes <=128. No significant difference for larger integers.
  • Add: Significantly faster for bit sizes <=128. No significant difference for larger integers.
  • Clone: Significantly faster for bit sizes <=128. No significant difference for larger integers.

Other optimizations

Specialize for integers that fit in 128 bits.

Ideally, addition, multiplication, and division for integers that fit in 128 bits would be as fast as cloning. Copying the bytes is significantly slower than doing the arithmetic.

  • Signed integer addition/subtraction is quite naive. It is comparing the absolute numbers before subtracting/adding. This has to be improved before specialization will have any noticeable effect. Surely we can get within 2x of GMP.

Use better GCD implementation.

The GCD implementation in num-integer is fairly slow and can be significantly improved by changing a line or two. On my system, this leads to a 10x speed improvement for BigInts when the integer fits in 64 bits.

Other oddities

  • Rug and Ramp are really good at cloning data. How can they clone large integers at more than twice the speed of num-bigint? Aren't we all just calling 'memcpy'?

lemmih avatar Jul 28 '21 01:07 lemmih

This PR is the same as the previous one with the exception of a few optimisations. The optimisations weren't necessary; they just proved that the SmallVec backend could be as efficient as Vec.

lemmih avatar Jul 28 '21 02:07 lemmih

Oh, and I added back the 'shr' optimisation because the regression was particularly egregious.

lemmih avatar Jul 28 '21 02:07 lemmih

@cuviper Thoughts?

lemmih avatar Aug 10 '21 05:08 lemmih

How much of a difference does the union make? I prefer that conceptually because we don't increase the size that way, but in past experiments ISTR it had mixed impact on performance overall.

If the union performance is good, then I think I'd prefer having a feature that toggles between Vec or SmallVec (w/ union), where the latter requires MSRV 1.49, but we'd keep the same struct size either way. I think the global type alias would make toggling easy, as long as we're not using any API that happens to be different between them. What do you think?

cuviper avatar Aug 27 '21 23:08 cuviper

The performance difference between union and no-union is minor and is dwarfed by the performance boost from avoiding heap allocations.

I'll update the PR to include a feature switch for SmallVec.

lemmih avatar Aug 28 '21 07:08 lemmih