hep sliceop/f64s: improve further algorithm in Take

as mentionned in https://github.com/go-hep/hep/pull/674#issuecomment-618947134, performances for certain workload sizes could be further improved.

investigate dip of performances for other workload sizes (128<n<1024*1024)

Apr 24 '20 12:04 sbinet

Hi, I have two things to share

I didn't see a performance hit with the new changes locally (my laptop stats: cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz)
I have a couple of changes which improve the performance for n>=128 (on my laptop)

To address the first point, I ran the tests 5 times each and here are the results (mid.log are the changes made in #674)

benchstat old.log mid.log 
name                old time/op    new time/op    delta
Take/Len=2-4          5.45ns ± 1%    5.45ns ± 1%    ~     (p=1.000 n=5+5)
Take/Len=4-4          5.85ns ± 0%    5.84ns ± 0%    ~     (p=0.198 n=5+5)
Take/Len=8-4          6.27ns ± 2%    6.29ns ± 1%    ~     (p=0.381 n=5+5)
Take/Len=128-4        60.3ns ± 0%    60.9ns ± 0%  +0.88%  (p=0.008 n=5+5)
Take/Len=1024-4        465ns ± 0%     468ns ± 0%  +0.48%  (p=0.016 n=5+5)
Take/Len=1048576-4     735µs ± 1%     750µs ± 1%  +2.10%  (p=0.008 n=5+5)

I did not see any slowdown for the n=[128,1024] case nor speedup for n=1048576 case.

Next, I found a couple areas for improvement:

error checking (i.e. panic) in the for loop can be more efficient if we check that v0 and v1 are correct before checking what is wrong
save index variables to access indices array less frequently

The resulting change are below, followed by the new timing results when compare to the old code (before #674)

diff --git a/sliceop/f64s/f64s.go b/sliceop/f64s/f64s.go
index 347e2f36..a5ced4ba 100644
--- a/sliceop/f64s/f64s.go
+++ b/sliceop/f64s/f64s.go
@@ -97,18 +97,21 @@ func Take(dst, src []float64, indices []int) []float64 {
                return dst
        }
 
-       dst[0] = src[indices[0]]
-       for i := 1; i < len(indices); i++ {
-               v0 := indices[i-1]
-               v1 := indices[i]
+  n := len(indices) - 1
+  v0 := indices[0]
+  for i:=0; i<n; i++ {
+    v1 := indices[i+1]
                switch {
+               case v0 < v1:
                case v0 == v1:
                        panic(errDuplicateIndices)
                case v0 > v1:
                        panic(errSortedIndices)
                }
-               dst[i] = src[v1]
+               dst[i] = src[v0]
+               v0 = v1
        }
+       dst[n] = src[v0]
 
        return dst
 }

benchstat old.log new.log
name                old time/op    new time/op    delta
Take/Len=2-4          5.45ns ± 1%    5.51ns ± 3%     ~     (p=0.333 n=5+5)
Take/Len=4-4          5.85ns ± 0%    5.85ns ± 0%     ~     (p=0.889 n=5+5)
Take/Len=8-4          6.27ns ± 2%    6.23ns ± 1%     ~     (p=0.056 n=5+5)
Take/Len=128-4        60.3ns ± 0%    51.9ns ± 0%  -13.88%  (p=0.008 n=5+5)
Take/Len=1024-4        465ns ± 0%     388ns ± 0%  -16.68%  (p=0.008 n=5+5)
Take/Len=1048576-4     735µs ± 1%     694µs ± 0%   -5.60%  (p=0.008 n=5+5)

Jul 05 '21 23:07 jucaleb4

Nice. Would you like to send a pull request with your changes?

Jul 06 '21 06:07 sbinet

hep hep copied to clipboard

sliceop/f64s: improve further algorithm in Take

hep
hep copied to clipboard