tracy
tracy copied to clipboard
Attribute sample entry call stacks to "parent" samples as an option.
As I keep using Tracy for analyzing my software, I found out that results can in certain cases be still misleading in a subtle way. Right now, it is obvious that calling non-inlined functions will not contribute sample counts to the call
instruction. That was clear, but would be useful as another toggle, in my opinion. I think Tracy again has the necessary information to display this as such: I think there are full call stacks available with every sample?
However, the case where this gets misleading is with memory IO. I was optimizing memory accesses using the SSE/AVX streaming instructions (_mm256_stream_ps
). Using this seemed to improve results in Tracy a lot, however... When analyzing the same with perf
in Linux, you get to see a flame chart. Turns out that actually the Linux kernel is getting in between in the middle of those streaming instructions to handle page faults. As far as I could tell, Tracy doesn't display this as a cost of the instructions, just like it doesn't display the cost of the function you call with a call
instruction.
Here is a demonstration of my findings:
After correctly filtering:
perf
's flamechart clearly shows the kernel interrupts as part of the stream_store
. About 80% of the time (or 19490 samples) is spent on phase_1
goes to this stream_store
.
Now let's have a look at Tracy:
Carefully selecting all instructions (and one after) those that are involved with the
movntps
together only yield 58.97% of samples. Clearly we are missing a few.
Looking at these sample statistics, the kernel interrupt is right there in red, so that's good! Looking at the Entry call stacks of these samples indeed correctly associate it with those streaming instructions. Changing the cost in the overview to number of samples, instead of %Time, we get to see this:
Together with more of these kernel functions as reported by
perf
:
About 10321 samples gathered in the kernel. Looking at the entry call stacks again:
About 10032 come from the store instruction.
Adding all those "lost" samples from kernel interrupts to those relevant instructions probably gets us to the 80% as reported by perf.
Overall, being able to see the cost of the kernel interrupts there would be useful, exactly like it would be useful for seeing the cost of the call
instruction. Both seem like the same mechanism: looking at the entry call stack. Just like "propagate inlines", you could have a "propagate call stacks". Slightly different in nature, as a callstack is runtime information from the stack, and as the "propagate inlines" feature uses debug information to trace back the hierarchy of inlines.
I think you are asking for the "child calls" option.
Hmm, let's look at the call simd::log()
:
Now with Child calls enabled:
No extra samples were counted, yet: about 2680 samples were counted in this function according to the statistics window:
The
call
instruction I showed is the only call site in my program in the selected time range (I filtered using this option:
)
Notice in my screenshot of last comment, that the other call indeed has a tremendous impact on the time attributed to that call (or the one instruction after it, as it's the return address). Something's going wrong here in a subtle way...
Can you provide the trace you are referencing above?
attributing_samples_child_calls.tracy.zip
(Funny: I used the most aggressive saving method, and then zipped it: zip took an additional 11% away from the file size. :exploding_head: (Had to zip to be able to upload, because Github was complaining about extensions.))
I used the most aggressive saving method, and then zipped it: zip took an additional 11% away from the file size. 🤯
That's still LZ4. Using zstd instead, with max compression setting, produces a 16.9 MB trace file that is not compressible further.