hep
hep copied to clipboard
sliceop/f64s: improve further algorithm in Take
as mentionned in https://github.com/go-hep/hep/pull/674#issuecomment-618947134, performances for certain workload sizes could be further improved.
investigate dip of performances for other workload sizes (128<n<1024*1024)
Hi, I have two things to share
- I didn't see a performance hit with the new changes locally (my laptop stats:
cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz) - I have a couple of changes which improve the performance for
n>=128(on my laptop)
To address the first point, I ran the tests 5 times each and here are the results (mid.log are the changes made in #674)
benchstat old.log mid.log
name old time/op new time/op delta
Take/Len=2-4 5.45ns ± 1% 5.45ns ± 1% ~ (p=1.000 n=5+5)
Take/Len=4-4 5.85ns ± 0% 5.84ns ± 0% ~ (p=0.198 n=5+5)
Take/Len=8-4 6.27ns ± 2% 6.29ns ± 1% ~ (p=0.381 n=5+5)
Take/Len=128-4 60.3ns ± 0% 60.9ns ± 0% +0.88% (p=0.008 n=5+5)
Take/Len=1024-4 465ns ± 0% 468ns ± 0% +0.48% (p=0.016 n=5+5)
Take/Len=1048576-4 735µs ± 1% 750µs ± 1% +2.10% (p=0.008 n=5+5)
I did not see any slowdown for the n=[128,1024] case nor speedup for n=1048576 case.
Next, I found a couple areas for improvement:
- error checking (i.e. panic) in the for loop can be more efficient if we check that
v0andv1are correct before checking what is wrong - save index variables to access
indicesarray less frequently
The resulting change are below, followed by the new timing results when compare to the old code (before #674)
diff --git a/sliceop/f64s/f64s.go b/sliceop/f64s/f64s.go
index 347e2f36..a5ced4ba 100644
--- a/sliceop/f64s/f64s.go
+++ b/sliceop/f64s/f64s.go
@@ -97,18 +97,21 @@ func Take(dst, src []float64, indices []int) []float64 {
return dst
}
- dst[0] = src[indices[0]]
- for i := 1; i < len(indices); i++ {
- v0 := indices[i-1]
- v1 := indices[i]
+ n := len(indices) - 1
+ v0 := indices[0]
+ for i:=0; i<n; i++ {
+ v1 := indices[i+1]
switch {
+ case v0 < v1:
case v0 == v1:
panic(errDuplicateIndices)
case v0 > v1:
panic(errSortedIndices)
}
- dst[i] = src[v1]
+ dst[i] = src[v0]
+ v0 = v1
}
+ dst[n] = src[v0]
return dst
}
benchstat old.log new.log
name old time/op new time/op delta
Take/Len=2-4 5.45ns ± 1% 5.51ns ± 3% ~ (p=0.333 n=5+5)
Take/Len=4-4 5.85ns ± 0% 5.85ns ± 0% ~ (p=0.889 n=5+5)
Take/Len=8-4 6.27ns ± 2% 6.23ns ± 1% ~ (p=0.056 n=5+5)
Take/Len=128-4 60.3ns ± 0% 51.9ns ± 0% -13.88% (p=0.008 n=5+5)
Take/Len=1024-4 465ns ± 0% 388ns ± 0% -16.68% (p=0.008 n=5+5)
Take/Len=1048576-4 735µs ± 1% 694µs ± 0% -5.60% (p=0.008 n=5+5)
Nice. Would you like to send a pull request with your changes?