tracy icon indicating copy to clipboard operation
tracy copied to clipboard

Attribute sample entry call stacks to "parent" samples as an option.

Open mcourteaux opened this issue 2 years ago • 6 comments

As I keep using Tracy for analyzing my software, I found out that results can in certain cases be still misleading in a subtle way. Right now, it is obvious that calling non-inlined functions will not contribute sample counts to the call instruction. That was clear, but would be useful as another toggle, in my opinion. I think Tracy again has the necessary information to display this as such: I think there are full call stacks available with every sample?

However, the case where this gets misleading is with memory IO. I was optimizing memory accesses using the SSE/AVX streaming instructions (_mm256_stream_ps). Using this seemed to improve results in Tracy a lot, however... When analyzing the same with perf in Linux, you get to see a flame chart. Turns out that actually the Linux kernel is getting in between in the middle of those streaming instructions to handle page faults. As far as I could tell, Tracy doesn't display this as a cost of the instructions, just like it doesn't display the cost of the function you call with a call instruction.

Here is a demonstration of my findings: image After correctly filtering: perf's flamechart clearly shows the kernel interrupts as part of the stream_store. About 80% of the time (or 19490 samples) is spent on phase_1 goes to this stream_store.

Now let's have a look at Tracy: image Carefully selecting all instructions (and one after) those that are involved with the movntps together only yield 58.97% of samples. Clearly we are missing a few.

image Looking at these sample statistics, the kernel interrupt is right there in red, so that's good! Looking at the Entry call stacks of these samples indeed correctly associate it with those streaming instructions. Changing the cost in the overview to number of samples, instead of %Time, we get to see this: image Together with more of these kernel functions as reported by perf: image

About 10321 samples gathered in the kernel. Looking at the entry call stacks again: image About 10032 come from the store instruction.

Adding all those "lost" samples from kernel interrupts to those relevant instructions probably gets us to the 80% as reported by perf.


Overall, being able to see the cost of the kernel interrupts there would be useful, exactly like it would be useful for seeing the cost of the call instruction. Both seem like the same mechanism: looking at the entry call stack. Just like "propagate inlines", you could have a "propagate call stacks". Slightly different in nature, as a callstack is runtime information from the stack, and as the "propagate inlines" feature uses debug information to trace back the hierarchy of inlines.

mcourteaux avatar Nov 23 '22 12:11 mcourteaux

I think you are asking for the "child calls" option.

obraz

wolfpld avatar Nov 23 '22 12:11 wolfpld

Hmm, let's look at the call simd::log(): image Now with Child calls enabled: image No extra samples were counted, yet: about 2680 samples were counted in this function according to the statistics window: image The call instruction I showed is the only call site in my program in the selected time range (I filtered using this option: image )

mcourteaux avatar Nov 23 '22 12:11 mcourteaux

Notice in my screenshot of last comment, that the other call indeed has a tremendous impact on the time attributed to that call (or the one instruction after it, as it's the return address). Something's going wrong here in a subtle way...

mcourteaux avatar Nov 23 '22 12:11 mcourteaux

Can you provide the trace you are referencing above?

wolfpld avatar Nov 24 '22 17:11 wolfpld

attributing_samples_child_calls.tracy.zip

(Funny: I used the most aggressive saving method, and then zipped it: zip took an additional 11% away from the file size. :exploding_head: (Had to zip to be able to upload, because Github was complaining about extensions.))

mcourteaux avatar Nov 24 '22 21:11 mcourteaux

I used the most aggressive saving method, and then zipped it: zip took an additional 11% away from the file size. 🤯

That's still LZ4. Using zstd instead, with max compression setting, produces a 16.9 MB trace file that is not compressible further.

wolfpld avatar Nov 25 '22 00:11 wolfpld