swift-numerics
swift-numerics copied to clipboard
[BigInt tests][No merge] 🐰 How NOT to write performance tests
Please read the #242 Using tests from “Violet - Python VM written in Swift” before.
=== DO NOT MERGE! Discussion only. ===
🐰 Discussion
Those are the all of the basic operations. In the future we may add some more targeted tests for interesting operations (mul/div), but for now this should be enough.
It may also be a good idea to move all of the performance tests to a separate test target. Currently I am using PERFORMANCE_TEST compilation flag.
You can use Violet as a baseline measure, thought Violet was heavily optimized for Int32 range (see documentation). I could easily write something faster, but this is not my use-case.
String parsing
That said, I have to say that I find the String parsing performance quite bad. From my tests Violet is ~30 times faster.
Violet secret:
Instead of using a single
BigIntand multiplying it byradix, we will group scalars into words-sized chunks. Then we will raise those chunks to appropriate power and add together.For example:
1_2345_6789 = (1 * 10^8) + (2345 * 10^4) + (6789 * 10^0)So, we are doing most of our calculations in fast
Word, and then we switch to slowBigIntfor a few final operations.
Implemented here. Should I create an issue for this?
Mac
I have 2014 rMBP -> mac 11.7 (Big Sur), Xcode 13.2.1, Intel. I assume that 9 years old machines are not exactly a priority for 🍎. Executing all of those tests take ~30 min (serial execution, DEBUG), and I can't be bothered to re-run them properly since it will throttle after a few minutes anyway…
Linux
Normally it would fail to compile:
PerformanceTests.generated.swift:27:23: error: cannot find type 'XCTMetric' in scope
private let metrics: [XCTMetric] = [XCTClockMetric()] // XCTMemoryMetric()?
^~~~~~~~~
PerformanceTests.generated.swift:28:23: error: cannot find 'XCTMeasureOptions' in scope
private let options = XCTMeasureOptions.default
^~~~~~~~~~~~~~~~~
But I wrote my own thing. Results below.
I wrote a test script that runs multiple BigInt implementations:
-
swift_numerics- branch from this PR -
Violet- this branch which is amainbranch with allswift_numericstests from all PRs- optimized for
Int32range. In this tests only0,-1and1fall into this range, so it should not matter.
- optimized for
-
Violet XsProMax- this branch. Violet implementation with following changes:-
no small inlined integer - magnitude is always stored on the heap
-
no restrictions on the size -
isNegativeis stored in-line (and not on the heap like in Violet);countandcapacityare on the heap because I don't want to stray too much fromViolet.struct BigInt { struct Header { var count: UInt32 var capacity: UInt32 } var flags: UInt8 var buffer: ManagedBufferPointer<Header, UInt> }
-
-
attaswift- this branch which is amainbranch with added performance tests
Platform:
- 🐧 Ubuntu 22.04.1 LTS 64-bit
- Intel Pentium(R) CPU G4600 @ 3.60GHz × 4
- Swift 5.7.3 (swift-5.7.3-RELEASE) from Swift.org
- Since
XCTestCase.measuredoes not exist on Linux I wrote my own thing.
Important:
-
numbers up to 6400 bits (
100 * UInt64.bitWidthin magnitude) -
images, because GitHub does not allow colors; text available here (with relative standard deviation).
-
inouttests start with copying a givenBigInt, so they may be slower than normal operation. -
values in parens mean relative performance improvement. For example:
🐧 swift_numerics 🐧 Violet 🐧 attaswift test_string_fromRadix10 0.2386572415 0.011443098625 (20.9x) 0.025662900375 (9.3x) Means that
Violetis 20.9 times faster thanswift_numericsandattaswiftis 9.3 times faster.
string

This test involved String operations on 1203 BigInts.
Parsing from string:
attaswiftis ~10x fasterVioletis 190x faster with radix 10 and 345x faster when radix is a power of 2Violethas an additional specialization for powers of 2- implementation available here
- already discussed in initial post (see above)
To string:
attaswiftis 12.5x faster for radix 10 and 152x faster for radix 16Violetis 21x faster for radix 10 and >2500x faster when radix is power of 2 (up to 3210x (!) faster for radix 16)Violethas an additional specialization for powers of 2.- implementation available here
- in
swift_numericsradix 16 takes more time than radix 10! Normally bigger radix -> faster calculation (because we divide by bigger number). This is not normal. - divide-and-conquer algorithm exists, but none of the implementations has it
equal/compare

This test involved ==/< operations on 1_447_209 BigInt pairs.
swift_numericsis the fastest - it just usesSwift.Arraycompareattaswift/Violet XsProMaxstore sign inline, so very often you can get the result without even looking at the heap.Violetis the slowest, though even a small difference would show here (small difference repeated 1_500_000 times becomes a big difference):- handling
Smi(inlinedInt32) has a certain cost - if we do not fit into
Smithen we have to fetch sign from the heap
- handling
- those operations are extremely fast, we called them ~1.5 million times! In Instruments we see
swift_releaseandswift_retainwith >10% each. I think that all of the implementations have sufficient performance.
unary

This test involved +-~ operations on 100_203 BigInts.
Plus:
- all of the projects use the default implementation from protocol
- if we wrote it by hand then it would be ~1000 slower (tested on
Violet)
- if we wrote it by hand then it would be ~1000 slower (tested on
- it is so fast that the relative standard deviation (not shown here) is basically a random number
- nothing to improve here
Minus:
attaswift/Violet XsProMaxare ~4x faster thanswift_numerics/Violet-
in
attaswift/Violet XsProMaxunary '-' flips the sign and retainsmagnitude-> nomemcpystruct BigInt { struct Header { var count: UInt32 var capacity: UInt32 } // Bool and Array in case of 'attaswift' var flags: UInt8 var buffer: ManagedBufferPointer<Header, UInt> } -
Violetstores both sign and magnitude on the heap -> unary '-' has tomemcpyand flip bit.- I wanted to store the sign in the pointer to get the same behavior as
attaswift, but I already use the tagged pointer to differentiate between small inlinedInt32and heap allocated storage. I didn't want to complicate it too much.
- I wanted to store the sign in the pointer to get the same behavior as
-
swift_numericsuses 2-complement ->memcpyand then a few bit operations.
-
Invert:
- all of the implementations have comparable performance
Violetis 0.653x ininttest, but we are talking about0.00803803swhich is basically nothingViolet XsProMaxis 1.31x inbig, but again fighting over0.00922301sdoes not make a lot of sense- in
Violetit is implemented as~a = negate(a+1). Doing thea+1by specific code is faster than general purposeaddoperation.
add, sub

This test involved +- operations on 162_409 BigInt pairs.
Add:
Violetis 3x faster thanswift_numericsViolet XsProMaxis 3.5x fasterattaswiftis 10% slower
Sub:
Violetis 3.6x faster thanswift_numericsViolet XsProMaxis 4.4x faster thanswift_numericsVioletis written by handswift_numericsusesa-b = a+(-b)which adds an additional operation (negation). It also means that-is slower than+which is weird.
mul, div, mod

This test involved */% operations on 91_809 BigInt pairs.
Mul:
-
Violetis 1.45x faster thanswift_numerics. -
school-book method (used by all of the implementations) is a pure ALU + write dependent, there is nothing 'fancy' to make it better.
- I tried the V8 method but it was 2x slower than the current Violet implementation. It may be because the implement more advanced algorithms, so their requirements/goals for the school-book method are a bit different.
- the code generated by the compiler in
Violetis 'ok'. You could write something better by hand, but that is not the name of the game.
-
attaswiftimplements Karatsuba, but only applies it abovedirectMultiplicationLimit: Int = 1024, so it was never used. Their base implementation 6x slower thanViolet.I am not sure how much time/effort they spend to get the
directMultiplicationLimitvalue. 1024 seems kind of 'round' (because it is a power of 2). In most of the implementations I have seen ~40 words (Word = UInt) which would show in our performance tests.If I was tasked with finding the correct value then I would look for a 'breaking point' where Karatsuba becomes faster than 'normal'. Similar to what the go lang did. Though
attaswift'normal' implementation is quite slow, so that may have affected their measurements (somehow). -
we will probably need more detailed tests once we start implementing more advanced algorithms.
Div/mod:
Violetis 2.4x faster thanswift_numerics.- Knuth 'Algorithm D' with 3
Wordquotient approximation
- Knuth 'Algorithm D' with 3
attaswift4x slower thanViolet.- we will probably need more detailed tests once we start implementing more advanced algorithms.
and, or, xor

This test involved &|^ operations on 162_409 BigInt pairs.
Violetis 2x faster thanswift_numerics; on paperswift_numericsshould have an advantage:Violetstores sign + magnitude - the implementation was taken from GMP - it addsxor,addand overflow check to every word.swift_numericsstores 2 complement which is the best representation for this operation
- small operations (
intrange) are weirdly slow inswift_numerics.Violetis 3.2x faster. Is there some costly preparation/initialization phase? attaswift7x slower thanViolet.
shift

This test involved shifting 20_203 BigInts by 5 values, in total 101_015 shifts per test.
Left:
Violetis 17x faster thanswift_numericsViolet XsProMaxis 20x fasterattaswiftis 10x faster- there is something really weird in
test_shiftLeft_big, theinoutis fast, but the normal version is much slower. Ultra repeatable, occurs every time.
- there is something really weird in
Right:
Violetis 20x faster thanswift_numericsViolet XsProMaxis 23x fasterattaswiftis 12x faster- if I remember correctly
attaswiftrounding mode is different than Swift stdlib
- if I remember correctly
pi
This test was suggested by Xiaodi Wu (@xwu) in BigInt #120, so we may ask them if we have any questions (maybe?). In theory it looks interesting, because it contains +-*/ operations.
In practice count 5000 has following input distribution (more detailed results here):
| Operation | Count | Notes |
|---|---|---|
| add | 39_280 | Inputs of similar size up to 3500 words (3500*UInt64 in magnitude) |
| sub | 5_000 | Mostly inputs of similar size up to 3500 words Some inputs where rhs is a single word (UInt64) |
| mul | 104_086 | lhs goes up to 3522 wordsrhs is always a single word (!!!) |
| div | 22_678 | Inputs of similar size up to 3500 words |
compare (>) |
16_602 | Inputs of similar size up to 3500 words So fast that it does not matter |
Anyway, here are the results:

Results (looking at the 5000 test):
swift_numericsis slower than python (1.33 vs 0.56)Violetis faster than Python (0.51 vs 0.56)Violet XsProMax- slower than Python (0.88 vs 0.56)
- slower than
Violet(0.88 vs 0.51) <- plot twist!
Violet XsProMax being slower than Violet is a surprise because in most of the tests above XsProMax was noticeably faster (5-10%). Though, Violet has an ace up its sleeve: it is really fast for small integers and this test performs 104_086 multiplications where rhs falls into Int32. This was not visible in the tests above because they concentrated on big numbers.
As for the swift_numerics vs attaswift: result is dominated by */. Other operations (namely: +-) are so few/fast that they are not even noticeable.
TLDR; Conclusions
| Operation | Possible improvement | Note |
|---|---|---|
init(String) |
190x for radix 10 345x when radix is a power of 2 |
Discussed in initial post |
toString |
21.5x for radix 10 >2500x when radix is a power of 2 (up to 3210x) |
Radix 16 is slower than radix 10? |
| Equality | - | |
| Compare | - | |
Unary + |
- | Do not write by hand, it is 1000x slower |
Unary - |
4x | Storing sign inline and magnitude on the heap is much faster (obviously, but now we have numbers) |
Unary ~ |
- | |
Binary + |
3x | |
Binary - |
4.4x | Implemented as a-b = a+(-b), writing it by hand may be faster (Violet does this) |
Binary * |
1.45x | |
Binary /% |
2.4x | |
Binary &|^ |
2x | |
| Left shift | 17x | |
| Right shift | 20x | |
| pi | 1.6x 2.6x if we store small values inline |
This test uses a lot of small integers.*/ dominate +-. |
Update 21.2.2023:
- added
equatable,comparableandπtests - improved Violet performance
- updated the posts above with new results
Anyway, from those tests we can see that Violet is the fastest.
Under the hood Violet uses ManagedBufferPointer. If swift_numerics decides to go with this route:
-
be careful when calling
ManagedBufferPointer.isUniqueReference()for COW.Doing this check on every write is expensive and it will NOT be optimized by the compiler. This matters in tight loops (multiplication etc.).
In
VioletI went with a token based system where callingBigIntStorage.guaranteeUniqueBufferReference()gives youUniqueBufferToken. All of themutatingoperations require this token. This way we do the COW check only once. Bonus points for being checked at the compile time (can't callmutatingoperation without a token). -
go into
UnsafeBufferPointerfor final operations.This ensures max performance. This is mostly visible in
mulanddiv. Sometimes allocating a separate buffer for result and then copying it intoBigIntmay be faster.For example the inner logic of division has following signature:
private static func inner( lhsCount: Int, rhsCount: Int, lhsWhichBecomesRemainder lhs: UnsafeMutableBufferPointer<Word>, rhs: UnsafeBufferPointer<Word>, result: UnsafeMutableBufferPointer<Word> ) { dot dot dot }It also means that you will be using
withXXXconstructs very often:lhs.withMutableWordsBuffer(lhsToken) { lhs in remainder.withMutableWordsBuffer(remainderToken) { remainder in Self.inner( lhsCount: lhsCount, rhsCount: rhsCount, lhsWhichBecomesRemainder: remainder, rhs: shiftedRhs, result: lhs ) } }
Obviously you can quite easily write something that will be faster than Violet. Violet is heavily geared towards small integers (Int32) which closes certain optimization opportunities.