fgprof icon indicating copy to clipboard operation
fgprof copied to clipboard

A small performance optimization

Open sumerc opened this issue 4 years ago • 2 comments

Hey Felix,

I was playing with runtime package/profilers a bit for learning and while trying out different things, I happen to see following code is used to retrieve all Goroutines:

for {
	n, ok := runtime.GoroutineProfile(p.stacks)
	if !ok {
		p.stacks = make([]runtime.StackRecord, int(float64(n)*1.1))
	} else {
		return p.stacks[0:n]
	}
}

However it turns out following is a bit faster as runtime.NumGoroutine() just returns len(gcount()) whereas GoroutineProfile seems to do bunch of other things.

p.stacks = make([]runtime.StackRecord, int(float64(runtime.NumGoroutine())*1.1))
for {
   ...
}

I have benchmarked following function:

func GoroutineProfile(useNumG bool) {
	var stacks []runtime.StackRecord

	if useNumG {
		nGoroutines := runtime.NumGoroutine()
		stacks = make([]runtime.StackRecord, int(float64(nGoroutines)*1.1))
	}

	for {
		n, ok := runtime.GoroutineProfile(stacks)
		if !ok {
                        // this can happen regardless of our approach
			stacks = make([]runtime.StackRecord, int(float64(n)*1.1))
		} else {
			break
		}
	}
}

And the results are:

go test -bench=. -benchmem                                                                                                   ~/Desktop/p/go-tryouts  
goos: linux
goarch: amd64
BenchmarkGoroutineProfile/Using_runtime.NumGoroutine()-12                 111543              9944 ns/op             768 B/op          1 allocs/op
BenchmarkGoroutineProfile/Not_using_runtime.NumGoroutine()-12              66451             16140 ns/op             768 B/op          1 allocs/op

While not too significant, it is not bad for a single line of change.

Would you be interested for a PR for this? Is there any other problem with the approach?

sumerc avatar Dec 09 '20 15:12 sumerc

Hey @sumerc thank you so much for taking an interest in this and reaching out. I love geeking out about this stuff : ).

As far as your change is concerned, I'd love to see a PR for it. I can totally see how it's faster for the function you've shown, but the code inside of fgprof is slightly different. The profiler struct is written in a way that the costs of discovering the ideal size of the stacks slice is amortized by only doing it once in the beginning. So your optimization would only help with the very first profiler.GoroutineProfile() call.

Anyway, if it's a small change I'd still be happy to merge it. And I'm also happy to just discuss ideas on how to make things faster in general!

felixge avatar Dec 09 '20 15:12 felixge

Anyway, if it's a small change I'd still be happy to merge it. And I'm also happy to just discuss ideas on how to make things faster in general!

Me too! Thanks for such inclusive/positive comment :)

The profiler struct is written in a way that the costs of discovering the ideal size of the stacks slice is amortized by only doing it once in the beginning

I understood. p.stacks is only initialized once. Well, I also confirmed you are right. The change did not change(too much) the output go test -bench=BenchmarkProfilerGoroutines but in anycase, I will submit the PR. Up to you to accept/reject :)

Thanks again!

sumerc avatar Dec 09 '20 15:12 sumerc

Closing this since https://github.com/felixge/fgprof/pull/11 was closed.

felixge avatar Aug 27 '22 12:08 felixge