hep icon indicating copy to clipboard operation
hep copied to clipboard

sliceop/f64s: improve further algorithm in Take

Open sbinet opened this issue 5 years ago • 2 comments

as mentionned in https://github.com/go-hep/hep/pull/674#issuecomment-618947134, performances for certain workload sizes could be further improved.

investigate dip of performances for other workload sizes (128<n<1024*1024)

sbinet avatar Apr 24 '20 12:04 sbinet

Hi, I have two things to share

  1. I didn't see a performance hit with the new changes locally (my laptop stats: cpu: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz)
  2. I have a couple of changes which improve the performance for n>=128 (on my laptop)

To address the first point, I ran the tests 5 times each and here are the results (mid.log are the changes made in #674)

benchstat old.log mid.log 
name                old time/op    new time/op    delta
Take/Len=2-4          5.45ns ± 1%    5.45ns ± 1%    ~     (p=1.000 n=5+5)
Take/Len=4-4          5.85ns ± 0%    5.84ns ± 0%    ~     (p=0.198 n=5+5)
Take/Len=8-4          6.27ns ± 2%    6.29ns ± 1%    ~     (p=0.381 n=5+5)
Take/Len=128-4        60.3ns ± 0%    60.9ns ± 0%  +0.88%  (p=0.008 n=5+5)
Take/Len=1024-4        465ns ± 0%     468ns ± 0%  +0.48%  (p=0.016 n=5+5)
Take/Len=1048576-4     735µs ± 1%     750µs ± 1%  +2.10%  (p=0.008 n=5+5)

I did not see any slowdown for the n=[128,1024] case nor speedup for n=1048576 case.

Next, I found a couple areas for improvement:

  • error checking (i.e. panic) in the for loop can be more efficient if we check that v0 and v1 are correct before checking what is wrong
  • save index variables to access indices array less frequently

The resulting change are below, followed by the new timing results when compare to the old code (before #674)

diff --git a/sliceop/f64s/f64s.go b/sliceop/f64s/f64s.go
index 347e2f36..a5ced4ba 100644
--- a/sliceop/f64s/f64s.go
+++ b/sliceop/f64s/f64s.go
@@ -97,18 +97,21 @@ func Take(dst, src []float64, indices []int) []float64 {
                return dst
        }
 
-       dst[0] = src[indices[0]]
-       for i := 1; i < len(indices); i++ {
-               v0 := indices[i-1]
-               v1 := indices[i]
+  n := len(indices) - 1
+  v0 := indices[0]
+  for i:=0; i<n; i++ {
+    v1 := indices[i+1]
                switch {
+               case v0 < v1:
                case v0 == v1:
                        panic(errDuplicateIndices)
                case v0 > v1:
                        panic(errSortedIndices)
                }
-               dst[i] = src[v1]
+               dst[i] = src[v0]
+               v0 = v1
        }
+       dst[n] = src[v0]
 
        return dst
 }
benchstat old.log new.log
name                old time/op    new time/op    delta
Take/Len=2-4          5.45ns ± 1%    5.51ns ± 3%     ~     (p=0.333 n=5+5)
Take/Len=4-4          5.85ns ± 0%    5.85ns ± 0%     ~     (p=0.889 n=5+5)
Take/Len=8-4          6.27ns ± 2%    6.23ns ± 1%     ~     (p=0.056 n=5+5)
Take/Len=128-4        60.3ns ± 0%    51.9ns ± 0%  -13.88%  (p=0.008 n=5+5)
Take/Len=1024-4        465ns ± 0%     388ns ± 0%  -16.68%  (p=0.008 n=5+5)
Take/Len=1048576-4     735µs ± 1%     694µs ± 0%   -5.60%  (p=0.008 n=5+5)

jucaleb4 avatar Jul 05 '21 23:07 jucaleb4

Nice. Would you like to send a pull request with your changes?

sbinet avatar Jul 06 '21 06:07 sbinet