swift-numerics icon indicating copy to clipboard operation
swift-numerics copied to clipboard

Initial pass at "relaxed" multiply and add operations.

Open stephentyrone opened this issue 2 years ago • 7 comments

This commit adds the following implementation hooks to the AlgebraicField protocol:

static func _relaxedAdd(_:Self, _:Self) -> Self
static func _relaxedMul(_:Self, _:Self) -> Self

These are equivalent to + and *, but have "relaxed semantics"; specifically, they license the compiler to reassociate them and to form FMA nodes, which are both significant optimizations that can easily make many common loops 8-10x faster. These transformation perturb results slightly, so they should not be enabled without care, but the results with the relaxed operations are--for most purposes--"just as good as" (and often better than) what strict operations produce. The main thing to beware of is that they are no longer portable; different compiler versions and different targets and optimization flags will result in different results.

These are then exposed under the Relaxed namespace as:

Relaxed.sum(a, b)
Relaxed.product(a, b)

stephentyrone avatar Nov 18 '21 02:11 stephentyrone

@swift-ci test

stephentyrone avatar Nov 18 '21 02:11 stephentyrone

@swift-ci test

stephentyrone avatar Nov 18 '21 02:11 stephentyrone

Hrm, why are we using a Swift-5.3.3 Linux toolchain for testing instead of something more recent? Still, good to know--if unfortunate--that reassociate(on) is not supported there. I'll have to add a workaround and a note for that.

stephentyrone avatar Nov 18 '21 02:11 stephentyrone

@swift-ci test

stephentyrone avatar Nov 18 '21 16:11 stephentyrone

@swift-ci test

stephentyrone avatar Nov 19 '21 01:11 stephentyrone

@swift-ci test

stephentyrone avatar Dec 01 '21 13:12 stephentyrone

@swift-ci test

stephentyrone avatar Apr 06 '22 14:04 stephentyrone

@swift-ci test

stephentyrone avatar Apr 26 '23 13:04 stephentyrone

Some quick perf numbers from my M1 laptop:

repeatedly summing 1024 Floats

time using reduce(0, +): 0.091 sec time using reduce(0, Relaxed.sum): 0.009 sec time using vDSP.sum from Accelerate: 0.004 sec

repeated dot-product of 1024 Floats

time using reduce(0) { $0 + $1*$1 }: 0.085 sec time using reduce(0) { Relaxed.multiplyAdd($1, $1, $0): 0.011 sec time using vDSP.sumOfSquares from Accelerate: 0.005 sec

For "typical" reduction workloads as above, we see about a 10x speedup over the strict operators, and we're about 2x off of hand-written SIMD.

stephentyrone avatar Apr 26 '23 13:04 stephentyrone