py-spy It appears as if some "total time" measures are completely off

:wave:

In the context of the Uyuni open source project I am looking to optimize a Python commandline tool (draft PR).

My commandline is:

py-spy record --native --output spacewalk-repo-sync-trace.json --format speedscope --function -- /usr/bin/python3 -u /usr/bin/spacewalk-repo-sync --channel sles12-sp5-updates-x86_64 --type yum --non-interactive

The program initially spawns 5 download threads that quit midway of the execution. After download is done, there is some processing, which is all I care for.

Strange factoid 1: according to speeedscope's "Time Order", the flame chart and even the raw output, all threads were somewhat active until the very last moment - I see stack traces from the download threads even close to the end. It's as if they were all interleaved - when in fact they were not (I did check with ps -T).

I patched a particular function called processCapabilities, which only works on the single main thread after all downloading is done. Before the latest patch in the above PR, it looked like one of the biggest offenders:

Screenshot from 2020-03-24 16-35-42

After the patch, it takes only 3% down from 18%! Good job!

Screenshot from 2020-03-24 16-37-39

...except that's not actually the case, the run time is actually longer. According to py-spy outputs, it's as if all other function calls magically became longer to more than compensate the saving in processCapabilities, which made little sense.

Finally I measured it with an extra block like the following:

        start_time = timeit.default_timer()
        ...processCapabilities(...)...
        elapsed = timeit.default_timer() - start_time
        global total
        total += elapsed
        log(total)

Strange factoid 2: the total duration actually got much worse with the patch instead of much better: total used to be 25% of the program run time (wall clock time) and after my patch it's 41%.

Strange factoid 3: py-spy does indeed sample ~100 times per second according to the number reported in standard output when the program terminates. But the number of samples in the bottom-most element of the flame graph/flame chart/speedscope is much lower, for example, in a run of 657s I got 65733 samples, but the bottom most element in the flame graph has 37229. That does not change if I use --threads and look at the all element. Where have all the other samples gone? Could this be the culprit?

Million-dollar question: how can I even get started understanding what is going on? Are those all real bugs?

Mar 24 '20 15:03 moio

Can you try with the --idle option?

py-spy samples 100x a second - but by default doesn't write out frames where the CPU isn't active. This is causing issue #3 that you report, and I think whats happening here is that the speedscope visualization is getting messed up by idle frames.

This might be a bit involved to fix unfortunately =(

Apr 04 '20 17:04 benfred

@benfred, I checked this again and:

Strange factoid 1 was really a PBCAK. Looking at time scales it is clear that threads in SpeedScope flames are always re-scaled to fit the whole window width, so you can't really superimpose different threads as all have their own scale. Sorry for the noise
Strage factoid 2 could be related to the fact I had an issue with NTP synchronization of the machine I use for tests. Specifically, the synchronization could happen while profiling was running - could this be the cause I saw unreasonable numbers? After making sure NTP is forcibly synchronized immediately before profiling, this issue seems to have disappeared
Strange factoid 3 is indeed explained by --idle

If you confirm NTP sync as a possible culprit for number 2 above, this issue can be closed.

Thanks for your patience!

May 03 '20 15:05 moio