fix(profilecli): disable `aggregate-callees` in go-pgo query by default
In the default configuration related to profilecli query go-pgo, the exported PGO file does not optimize or improve performance. Using --no-aggregate-callees results in a performance boost, but the prompt for the aggregate-callees parameter is not user-friendly and may mislead users into thinking they should use --aggregate-callees=false. Therefore, I suggest changing the default value of aggregate-callees to false and adding relevant prompts.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Hi @JansonLv, thanks for the contribution!
Indeed, I believe a clarification could improve UX. The --no-X semantics is enforced with the CLI framework we use, and I think I should have added a clarification regarding --no-aggregate-callees.
However, I have a couple of questions:
In the default configuration related to profilecli query go-pgo, the exported PGO file does not optimize or improve performance. Using --no-aggregate-callees results in a performance boost
I'd like to learn more about the case:
- Which Go version you're using
- Can we get profiles exported with and without
aggregate-calleesoption? I understand that profiles may include sensitive data; but dropping strings from the pprof file should help (I can provide you a script that does this)
For the context: the go-pgo optimization is inspired by the discussion and the snippet by Michael Pratt (Go team). I suspect that in the most recent Go version, it may be harmful. Another possibility is that there is a bug in how the callees are being aggregated.
@kolesnikovae Hi, thank you for your reply. I am using version go1.22.7.
Thank you @JansonLv! I'll double check that our go-pgo works as expected on go tip. In the meantime, you can find the script in the cmd/trimstrings branch – the simplest way to run it is to clone pyroscope repo and run
go install ./cmd/trimstrings
@kolesnikovae thanks,Please check if this is the file you are looking for. with_agg_profile_clean.pb.gz without_agg_profile_clean.pb.gz
hi,@kolesnikovae Sorry, that was generated by another project; you can review the information from these two files, and I have modified the trimstrings code, processing only part of the information. with_agg_profile_clean.pb.gz without_agg_profile_clean.pb.gz
Thank you so much for providing the samples!
Profiles look good, however they contain surprisingly few samples ~800 and ~1000 samples correspondingly (16.64s of CPU time in both). Idea of go-pgo query was to reduce the profile size and speed up Go builds – usually, when you merge profiles over a period of time (e.g., a couple of days), they include millions of samples, which becomes a problem.
UPD: Same for the second pair: 756/1223/17.45s. I'd suggest querying a longer period of time.
I'm wondering how you've measured the impact of PGO – is it possible to check the compiler logs?
UPD2: correct command:
go build -pgo=auto -gcflags="-m -d=pgodebug=99" . 2>&1 | grep -e "inlining call to" -e "PGO devirtualizing call to" | wc -l
@kolesnikovae I obtained this through stress testing the service and did not use any related Go commands. Thank you for your guidance.
Thank you for sharing this! I'd like to figure out why aggregation affects PGO results, because this is really helpful feature and I wouldn't like disabling it by default.
In the log I see use of CGO and lots of undefined reference errors and that both gcc and ld failed – I'm curious if the build was successful. Could you please build the program with -m -d=pgodebug=99 added to -gcflags? Something like:
go build -pgo=auto -gcflags="-m -d=pgodebug=99" . 2>&1 | grep -e "inlining call to" -e "PGO devirtualizing call to" | wc -l
This will tell us how many optimisations Go compiler made (very roughy), and estimate impact of the PGO (not the app performance).
As for load/stress tests – as far as I understand from the log, it might involve IO (message broker), which would very likely dominate the result. Usually, the expected improvement (reduce in CPU consumption) is within 2-5%. Also, please note that PGO won't help with C code.
hi,@kolesnikovae I'm so sorry to keep you waiting for so long. It took me some time to set up the relevant environment locally.
In the log I see use of CGO and lots of undefined reference errors and that both gcc and ld failed – I'm curious if the build was successful.
There is indeed a problem, which is why I didn't add the required tag for github.com/confluentinc/confluent-kafka-go/v2/kafka."
in the end, the result is still 0, even though I used the one obtained through the /debug/pprof/profile. I also want to figure out why this is happening.
I think we can keep the aggregation for now, as it will take some time to investigate this issue.
Enjoy your weekend
./main.go:6:13: PGO devirtualize considering call cmd.Execute() {"Pkg":"main","Pos":"/mnt/c/Users/bodhi/worker/server/main.go:6:13","Caller":"main.main","Direct":true,"Interface":false,"Weight":0,"Hottest":"server/cmd.Execute","HottestWeight":0,"Devirtualized":"","DevirtualizedWeight":0} hot-callsite-thres-from-CDF=0.02697599136768276 hot-cg before inline in dot format: digraph G { forcelabels=true; "main.main" [color=black, style=solid, label="main.main"]; "server/cmd.Execute" [color=black, style=solid, label="server/cmd.Execute"]; edge [color=black, style=solid]; "main.main" -> "server/cmd.Execute" [label="0.00"]; } ./main.go:5:6: can inline main
Hi @JansonLv, no worries at all – I'm here to help. Thank you for sharing this! Please let me know if I can help you in any way.
In the meantime, I believe that clarifying the CLI help message and documentation (in ./docs) would improve the user experience. If you're willing to contribute, your effort would be very welcome :) Otherwise, I can take care of this
Hi,@kolesnikovae, I'm very honored to contribute to Grafana and improve the user experience. I will make the necessary modifications and commit them as soon as possible. I'm also glad to have your help.
Hi, @kolesnikovae, I have submitted the code, but I am uncertain whether the last sentence ‘Try both options to see which gives better for your PGO’ is suitable. I hope to get your opinion on this.
Wish you a pleasant weekend
hi, @kolesnikovae , I noticed that the hot-callsite-thres-from-CDF flag, which is related to PGO (Profile-Guided Optimization), may vary depending on the profile. So, would the parameters obtained from the original profile differ from those obtained after getting PGO through Pyroscope and then performing PGO compilation? It should be related to keep-locations, but there shouldn't be a connection if it's large enough; theoretically, it shouldn't have anything to do with aggregate-callees either, right?
I think that CDF will be affected in any case. This is what is going on (simplified):
- a PGO profile is a graph
{ node, edge }, where each node represents a call site, and edges (function A calls B) have weights. - PGO finds "hot nodes": nodes connected by the most heavy edges that give 99% (default) of the total weight.
-
hot-callsite-thres-from-CDFis the weight of the edge that made it to go over the threshold.
Thus, the aggregation and trimming will affect the hot-callsite-thres-from-CDF, because they change the graph. Nevertheless, its impact on the PGO results is expected to be negligible. As far as I can see, hot-callsite-thres-from-CDF is only used in debug purposes
@kolesnikovae , Thank you for explanation, I will spend the next two weeks comparing them in practice online to decide whether to continue exploring further.
Thank you for your patience, @JansonLv! I'm very interested in the results of your research, and I would greatly appreciate it if you could share your experiences on how Pyroscope could be enhanced to better support PGO – your insights would be highly valuable!
Thank you for your patience, @kolesnikovae
Based on online practice, it has been demonstrated that using the --no—aggregate-callees parameter is more likely to bring about performance improvements.
We have two web services:
one of these services involves a lot of I/O processing and has high latency. In this case, PGO did not result in any noticeable improvement.
For the other web service, which only has a small number of I/O requests, when we initially deployed it with the --aggregate-callees flag, there was almost no observable improvement. However, during the second deployment with the --no—aggregate-callees flag, the results were significantly better, with CPU load decreasing by about 15%.
This situation aligns with my expectations but seems to be outside of yours. It appears that we need to delve deeper into this issue.
Given this, should we reopen an issue to specifically discuss this problem? As this is not something that can be resolved quickly.
This is a comparison before (--aggregate-callees) and after (--no—aggregate-callees) the deployment:
Hi @JansonLv, of course, please feel free to create a new issue dedicated to the problem. It would be great if we could take a look at the compilation log with pgodebug turned on
Thank you for your patience, @kolesnikovae,I have added two compilation logs, but they are too complex, and I may not be able to provide much insight. See if they can be of any help to you.
https://github.com/grafana/pyroscope/issues/3645