pyinstrument Significant performance overhead when timing requires a syscall on Linux

~Hi, I recently saw a performance regression while using pyinstrument. It seems to be connected to the 4.5.1 -> 4.5.3 update but also might be related to Python version (3.9.17 to 3.9.18).~ Update: This is likely related to sympy library use.

I have two Github CI runs which use pyinstrument to perf some code that uses sympy.

Good run

https://github.com/J08nY/pyecsca/actions/runs/6039852294/job/16389425413

In the good run, pyinstrument is 4.5.1 and Python is 3.9.17 and the logs look reasonable (look here for example, it takes about 1.2 seconds).

Bad run

https://github.com/J08nY/pyecsca/actions/runs/6247569280/job/16960430761

In the bad run, pyinstrument is 4.5.3 and Python is 3.9.18 and the logs look very long (look here for example, it now takes 13.6 seconds). I understand the logs are quite verbose since I have the show_all option enabled, but that does not explain the runtime increase.

Local test

Note that there are some additional calls to sympy added in commits between the two runs. However I believe these are not the cause of the performance regression, since I can observe the regression locally.

Running the code without pyinstrument I get about 5 second runtime:

start = datetime.now()
for _ in range(operations):
    one_point = dbl(p256.curve.prime, one_point, **p256.curve.parameters)[0]
end = datetime.now()
print(end - start)

Profiling 5000 256-bit doubling formula (dbl2015rcb) executions...
0:00:05.042373

Running with pyinstrument I get about 90 seconds:

Profiling 5000 256-bit doubling formula (dbl2015rcb) executions...

  _     ._   __/__   _ _  _  _ _/_   Recorded: 17:00:13  Samples:  90372
 /_//_/// /_\ / //_// / //_'/ //     Duration: 90.888    CPU time: 90.885
/   _/                      v4.5.3

Program: pyecsca/test/ec/perf_formula.py

90.888 main  test/ec/perf_formula.py:12
└─ 90.859 DoublingEFDFormula.__call__  pyecsca/ec/formula.py:223
   ├─ 85.403 DoublingEFDFormula.__validate_assumptions  pyecsca/ec/formula.py:154
   │  ├─ 42.555 sympify  sympy/core/sympify.py:101
   │  │     [30 frames hidden]  sympy, <string>, <built-in>, ast
   │  ├─ 20.666 Add.subs  sympy/core/basic.py:838
   │  │     [21 frames hidden]  sympy
   │  ├─ 5.970 Poly.__new__  sympy/polys/polytools.py:162
   │  │     [13 frames hidden]  sympy
   │  ├─ 5.280 FiniteField.__call__  sympy/polys/domains/domain.py:379
   │  │     [6 frames hidden]  sympy
   │  ├─ 5.184 ground_roots  sympy/polys/polytools.py:3718
   │  │     [11 frames hidden]  sympy
   │  ├─ 2.467 resolve  pyecsca/ec/formula.py:197
   │  │  └─ 1.241 resolve  pyecsca/ec/formula.py:197
   │  └─ 1.135 FiniteField.__init__  sympy/polys/domains/finitefield.py:114
   ├─ 3.505 CodeOp.__call__  pyecsca/ec/op.py:146
   │  └─ 2.558 <module>  None
   │     └─ 1.786 GMPMod.method  pyecsca/ec/mod.py:115
   └─ 1.075 [self]  pyecsca/ec/formula.py

Note that it seems the pyinstrument version does not matter, as I get roughly 90 seconds even if I use 4.3.0. I am using Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] locally.

Oct 06 '23 15:10 J08nY

I'm a bit confused, do you think the CPython upgrade caused the performance regression? Or was it always there?

Oct 06 '23 15:10 joerick

Another thing that has changed between these runs is the github actions runner image. But there could also be other factors, like the runner on GHA that actually executed the job. If you can give me a local recreation of the performance issue, I can debug it though. If it's 1.5s when running with python and 90s when running with pyinstrument, it would indeed seem to be a bug.

Oct 06 '23 15:10 joerick

I did some more digging and the culprit seems to be the sympy calls. The situation looks something like this (pyinstrument 4.5.3, sympy 1.12, Python 3.11.5):

	No sympy	sympy
No pyinstrument	5s	5s
pyinstrument	5s	90s

The sympy calls are really not that performance intensive, in the non-instrumented run they take almost no time at all.

Oct 06 '23 15:10 J08nY

If you can give me a local recreation of the performance issue, I can debug it though. If it's 1.5s when running with python and 90s when running with pyinstrument, it would indeed seem to be a bug.

I will try to create a simple reproducer. Should not be so hard.

Oct 06 '23 15:10 J08nY

All right, here is a reproducer (sympy 1.12, pyinstrument 4.5.3, Python 3.11.5). With python sympy_instrument.py, prints This took 0:00:00.636278 With pyinstrument sympy_instrument.py, prints This took 0:00:12.355938

from datetime import datetime

from sympy import sympify, FF, symbols, Poly, Rational


def do_thing():
    # Some elliptic curve crypto stuff that is not important
    field = 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF
    params = {
        "a": 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFC,
        "b": 0x5AC635D8AA3A93E7B3EBBD55769886BC651D06B0CC53B0F63BCE3C3E27D2604B,
    }
    k = FF(field)
    expr = sympify(f"3*b - b3", evaluate=False)
    for curve_param, value in params.items():
        expr = expr.subs(curve_param, k(value))
    param = str(expr.free_symbols.pop())

    def resolve(expression, k):
        if not expression.args:
            return expression
        args = []
        for arg in expression.args:
            if isinstance(arg, Rational):
                a = arg.p
                b = arg.q
                res = k(a) / k(b)
            else:
                res = resolve(arg, k)
            args.append(res)
        return expression.func(*args)

    expr = resolve(expr, k)
    poly = Poly(expr, symbols(param), domain=k)
    roots = poly.ground_roots()
    for root in roots:
        params[param] = int(root)
        break


if __name__ == "__main__":
    start = datetime.now()
    for _ in range(1000):
        do_thing()
    end = datetime.now()
    print("This took", end - start)

Oct 06 '23 16:10 J08nY

The reproducer above, when run with varying the number of iterations leads to this runtime graph. I also ran it with cProfile, which shows a similar slowdown. Then I tried an empty sys.setprofile function, that showed no statistically significant overhead (so is not on the plot). Finally, I tried including just time.time() inside the sys.setprofile function and that seems to match the cProfile/pyinstrument slowdown well.

My conclusion from that is:

Sympy triggers a ton of profiling events.
To know when to sample, pyinstrument needs to know the time, so every profiling event needs to call some timer.
With a large amount of function calls that are very fast (so lots of profiling events), this overhead of getting the time (and then skipping the profiling event) is large.

Figure_1

Oct 06 '23 16:10 J08nY

This is highly platform-dependent, I think. Running 1000 iterations on my Mac, I get

$ python examples/sympy_perf.py
This took 0:00:00.403314
$ pyinstrument examples/sympy_perf.py
This took 0:00:00.845693

Which admittedly is a 100% overhead, but nothing like the kind of deltas you're getting here.

What platform are you running on? We have had reports of stuff like this in Docker containers, for example.

Oct 07 '23 12:10 joerick

Having this reproducer is really useful, thank you for that. I think there are some low-level things that can improve performance, once I know what platform you're on.

Oct 07 '23 12:10 joerick

I'm assuming you're running this on Linux? I want to test a theory, could you run this code and tell me the output?

#include <stdio.h>
#include <elf.h>
#include <sys/auxv.h>

int main() {
    unsigned long vdso_start = getauxval(AT_SYSINFO_EHDR);
    if (vdso_start) {
        printf("vDSO is active at address: %lx\n", vdso_start);
    } else {
        printf("vDSO does not seem to be active.\n");
    }
    return 0;
}

Save that to a file vdso.c, compile it with clang vdso.c (you might have to apt install clang or similar), then run it with ./a.out.

The vDSO is a mechanism where the kernel makes the timing function gettimeofday really fast by injecting code into userspace to avoid a syscall. But I wonder if it doesn't happen in some cases, leading to these performance issues.

Oct 08 '23 09:10 joerick

What platform are you running on? We have had reports of stuff like this in Docker containers, for example.

Having this reproducer is really useful, thank you for that. I think there are some low-level things that can improve performance, once I know what platform you're on.

Interesting, I am running ArchLinux (6.5.5 kernel) on x86_64. But also saw this in GitHub CI. I am not sure what other details are important.

Btw. I think a simple reproducer is the following:

from datetime import datetime

def func():
    pass

def do_func_loop():
    for _ in range(6000):
        func()

if __name__ == "__main__":
    start = datetime.now()
    for _ in range(1000):
        do_func_loop()
    end = datetime.now()
    print((end - start).total_seconds())

I get:

$ python repro.py
0.185707
$ pyinstrument repro.py
15.766407

Save that to a file vdso.c, compile it with clang vdso.c (you might have to apt install clang or similar), then run it with ./a.out.

Did this, said vDSO is active at address: 7fffff7af000.

I could perhaps try measuring performance of the gettimeofday in C to confirm it is indeed the culprit.

Oct 08 '23 09:10 J08nY

I could perhaps try measuring performance of the gettimeofday in C to confirm it is indeed the culprit.

Yeah, this would be the next step, I suppose. There are also other ways to get the time on POSIX, like clock_gettime(). I'll see about writing a script that compares them.

Oct 08 '23 13:10 joerick

Here's a script:

#include <time.h>
#include <stdio.h>
#include <sys/time.h>

typedef enum {
    CT_BASELINE,
    CT_GETTIMEOFDAY,
    CT_MONOTONIC,
    CT_MONOTONIC_RAW,
    CT_CLOCK,
    CT_REALTIME,
    CT_MONOTONIC_COARSE,
} ClockType;

static volatile double dummy_clock = 0.0;

static double
floatclock(ClockType clock_type)
{
    switch (clock_type) {
    case CT_BASELINE:
        return dummy_clock;
    case CT_GETTIMEOFDAY: {
        struct timeval t;
        gettimeofday(&t, (struct timezone *)NULL);
        return (double)t.tv_sec + t.tv_usec*0.000001;
    }
    case CT_MONOTONIC: {
        struct timespec t;
        clock_gettime(CLOCK_MONOTONIC, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
    }
    case CT_MONOTONIC_RAW: {
        struct timespec t;
        clock_gettime(CLOCK_MONOTONIC_RAW, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
    }
    case CT_CLOCK:
        return (double)clock() / CLOCKS_PER_SEC;
    case CT_REALTIME: {
        struct timespec t;
        clock_gettime(CLOCK_REALTIME, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
    }
    case CT_MONOTONIC_COARSE: {
#ifdef CLOCK_MONOTONIC_COARSE
        struct timespec t;
        clock_gettime(CLOCK_MONOTONIC_COARSE, &t);
        return (double)t.tv_sec + t.tv_nsec*0.000000001;
#endif
    }
    }
    return 0.0;
}

static double timing_overhead(ClockType clock_type) {
    int NUM_ITERATIONS = 1000000;
    double counter = 0.0;
    double start = floatclock(CT_MONOTONIC);
    for (int i = 0; i < NUM_ITERATIONS; i++) {
        counter += floatclock(clock_type);
    }
    double end = floatclock(CT_MONOTONIC);
    if (counter == 0.0) {
        printf("!no-op! ");
    }
    return end - start;
}

int main()
{
    printf("Baseline: ");
    printf("%.3f\n", timing_overhead(CT_BASELINE));
    printf("gettimeofday: ");
    printf("%.3f\n", timing_overhead(CT_GETTIMEOFDAY));
    printf("clock_gettime(CLOCK_MONOTONIC): ");
    printf("%.3f\n", timing_overhead(CT_MONOTONIC));
    printf("clock_gettime(CLOCK_MONOTONIC_RAW): ");
    printf("%.3f\n", timing_overhead(CT_MONOTONIC_RAW));
    // clock isn't interesting imo
    // printf("clock(): ");
    // printf("%.3f\n", timing_overhead(CT_CLOCK));
    printf("clock_gettime(CLOCK_REALTIME): ");
    printf("%.3f\n", timing_overhead(CT_REALTIME));
    printf("clock_gettime(CLOCK_MONOTONIC_COARSE): ");
    printf("%.3f\n", timing_overhead(CT_MONOTONIC_COARSE));
    return 0;
}

This is what it prints on my system (macOS 13.4, arm64 M1 Max).

$  ./a.out
Baseline: !no-op! 0.007
gettimeofday: 0.030
clock_gettime(CLOCK_MONOTONIC): 0.031
clock_gettime(CLOCK_MONOTONIC_RAW): 0.023
clock(): 0.288
clock_gettime(CLOCK_REALTIME): 0.022
clock_gettime(CLOCK_MONOTONIC_COARSE): !no-op! 0.003

In an aarch64 ubuntu Docker container on the same machine:

Baseline: !no-op! 0.006
gettimeofday: 0.054
clock_gettime(CLOCK_MONOTONIC): 0.031
clock_gettime(CLOCK_MONOTONIC_RAW): 0.025
clock(): 0.423
clock_gettime(CLOCK_REALTIME): 0.025
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.005

Oct 08 '23 13:10 joerick

I've tried the above script on a few machines. My conclusion is that it's not as simple as the existence of the VDSO. It always seems to be there in my testing. But, the VDSO will make a syscall if it doesn't support the current clock source. If it makes a syscall, it's really slow.

You can inspect the clock source like this:

$ # get the clock source
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
arch_sys_counter

$ # get available clock sources on this machine
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
arch_sys_counter 

$ # I believe you can set the current clock source using this:
$ # echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource

In my testing, running Docker containers on arm64 uses a clock source arch_sys_counter, which is supported by VDSO, but running Docker containers on x86_64 gives hyperv_clocksource_tsc_page, which isn't supported by VDSO, making it slow. I suspect something similar is happening to you @J08nY . The surefire way to test this theory is to run the program above with strace, e.g. strace -c ./a.out (maybe reducing the iterations a bit). You can see which clock sources hit syscalls.

Oct 08 '23 14:10 joerick

This is what it prints on my system (macOS 13.4, arm64 M1 Max).

I get the following:

Baseline: !no-op! 0.007
gettimeofday: 1.213
clock_gettime(CLOCK_MONOTONIC): 1.217
clock_gettime(CLOCK_MONOTONIC_RAW): 1.218
clock_gettime(CLOCK_REALTIME): 1.215
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.007

You can inspect the clock source like this:

I get:

$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm

The surefire way to test this theory is to run the program above with strace, e.g. strace -c ./a.out (maybe reducing the iterations a bit). You can see which clock sources hit syscalls.

Tested this with just 1000 iterations, I get:

Baseline: !no-op! 0.000
gettimeofday: 0.049
clock_gettime(CLOCK_MONOTONIC): 0.045
clock_gettime(CLOCK_MONOTONIC_RAW): 0.053
clock_gettime(CLOCK_REALTIME): 0.046
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.000
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 72,20    0,017519           5      3012           clock_gettime
 23,34    0,005664           5      1000           gettimeofday
  2,20    0,000533         533         1           execve
  0,78    0,000189          23         8           mmap
  0,40    0,000096          16         6           write
  0,21    0,000052          17         3           newfstatat
  0,20    0,000049          24         2           pread64
  0,16    0,000040          20         2           openat
  0,09    0,000023           7         3           mprotect
  0,07    0,000018          18         1           munmap
  0,07    0,000017           5         3           brk
  0,05    0,000013          13         1         1 access
  0,05    0,000012          12         1           read
  0,05    0,000012           6         2           close
  0,05    0,000012           6         2         1 arch_prctl
  0,02    0,000004           4         1           getrandom
  0,01    0,000003           3         1           set_tid_address
  0,01    0,000003           3         1           prlimit64
  0,01    0,000003           3         1           rseq
  0,01    0,000002           2         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100,00    0,024264           5      4052         2 total

So it seems all except the MONOTONIC_COARSE are going through a syscall.

I found two nice blogposts discussing the situation, apparently it has affected quite some projects before. This post has some code for benchmarking the different clock sources. With it I get:

clocks.c
                    clock	       res (ns)	           secs	          nsecs
             gettimeofday	          1,000	  1,696,777,115	    253,319,000
           CLOCK_REALTIME	              1	  1,696,777,115	    253,328,015
    CLOCK_REALTIME_COARSE	      3,333,333	  1,696,777,115	    250,281,928
          CLOCK_MONOTONIC	              1	        257,505	    149,471,014
      CLOCK_MONOTONIC_RAW	              1	        257,501	    989,271,174
   CLOCK_MONOTONIC_COARSE	      3,333,333	        257,505	    146,418,780

ClockBench.cpp
                   Method	samples	   min	    max	    avg	 median	  stdev
           CLOCK_REALTIME	1023	 907.	84578.	1297.50	42742.50  2635.12
    CLOCK_REALTIME_COARSE	1023	   0.	    0.	   0.	    0.	     0.
          CLOCK_MONOTONIC	1023	 838.	 1467.	1093.37	 1152.50   216.66
      CLOCK_MONOTONIC_RAW	1023	 907.	 1467.	1204.10	 1187.00   237.25
   CLOCK_MONOTONIC_COARSE	1023	   0.	    0.	   0.	    0.	     0.
              cpuid+rdtsc	1023	  51.	   68.	  57.88	   59.50     8.34
                   rdtscp	1023	  17.	   34.	  28.33	   25.50     8.01
                    rdtsc	1023	   1.	   17.	  15.24	    9.00     4.85
Using CPU frequency = 1.000000

The COARSE clocks seem unreasonably fast, but if I understand their vdso implementation, it amounts to basically just a memory read, so these could be the real numbers. They also have low precision.

Oct 08 '23 15:10 J08nY

I tried changing the gettimeofday call into a clock_gettime call for CLOCK_MONOTONIC_COARSE in pyinstrument and these are the times I got for the reproducer. I think it may be worthwhile to make pyinstrument more robust to slow clocks by having some mechanism that tests them and picks one that works (perhaps warning the user if the clock chosen cannot satisfy the requested profiling interval.

Figure_2

Oct 08 '23 15:10 J08nY

@J08nY I've started some work to improve this here #273

It's still very draft, but on the off-chance that you have time, I'm actually currently hitting a problem on Linux that would be great to get some debugging help on. On macos I can test bits of the C API from ctypes using ctypes.pythonapi, but it doesn't work on linux for some reason. if you get a chance to debug that would help speed things along :)

Oct 14 '23 15:10 joerick

Perhaps ctypes.CDLL(None) would work, or maybe the linker didn't actually load the library yet, since we didn't actually try to call into it before we try to load the symbol.

Oct 14 '23 15:10 joerick

I will have a look at that PR.

Oct 14 '23 15:10 J08nY

It's been a while, but if you had a chance to look at #273, that would be great, @J08nY . The two approaches have been implemented. I'd also be curious to see the output of python metrics/count_samples.py on your machine.

Jul 31 '24 16:07 joerick

Here is the count_samples output:

pyinstrument: the timer on your system has an overhead of 1201 nanoseconds,
which is considered high. You might experience longer runtimes than usual, and
programs with lots of pure-python code might be distorted.

You might want to try the timing thread option, which can be enabled using
--use-timing-thread at the command line, or by setting the use_timing_thread
parameter in the Profiler constructor.

Your system does offer a 'coarse' timer, with a lower overhead (0 nanoseconds).
You can enable it by setting pyinstrument's interval to a value higher than
0.00333 seconds. If you're happy with the lower precision, this is the best
option.

If you want to suppress this warning, you can set the environment variable
PYINSTRUMENT_IGNORE_OVERHEAD_WARNING to '1'.
Interval: 0.001, use_timing_thread: False
Expected samples: 100.0, actual samples: 99, 0.99x
Interval: 0.001, use_timing_thread: True
Expected samples: 100.0, actual samples: 91, 0.91x
Interval: 0.0001, use_timing_thread: False
Expected samples: 1000.0, actual samples: 992, 0.99x
Interval: 0.0001, use_timing_thread: True
Expected samples: 1000.0, actual samples: 601, 0.60x

Trying the examples/sympy_perf.py I get:

Variant	Output
Python	`This took 0:00:00.590824`
pyinstrument #273, no other CLI options	`This took 0:00:11.999592` plus the new warning.
pyinstrument #273 with `--use-timing-thread`	`This took 0:00:01.320666`
pyinstrument #273 with `-i 0.00334`	`This took 0:00:12.135095`

So I guess the low-overhead timer is not getting used? Or still taking a lot of time?

Jul 31 '24 20:07 J08nY

Hmm, interesting! Thank you for the detailed output. Yes, I think you're right, it looks like the lower overhead timer isn't being used still. This is probably a simple coding oversight. I'll have a proper look tomorrow.

I am encouraged by the performance of the timing thread on your machine though! I get slightly worse results on my machine, in that the profiled thread doesn't see the time update as frequently- I get around 0.75x samples that way. I think the reasons for this are related to processor cache or other low-level stuff. I could fully mutex lock around that time variable but I worry that would send us back into syscall land. With sample counts of 0.75x or higher, I think I'm happy with this approach.

Jul 31 '24 23:07 joerick

That issue with the coarse timer should be resolved, would you mind to check it again?

Aug 01 '24 10:08 joerick

Now (fbc1f18) I get:

Variant	Output
pyinstrument #273, no other CLI options	`This took 0:00:12.200014` plus the new warning.
pyinstrument #273 with `--use-timing-thread`	`This took 0:00:01.238449`
pyinstrument #273 with `-i 0.00334`	`This took 0:00:01.306539`
pyinstrument #273 with `-i 0.01`	`This took 0:00:01.260370`

So it looks like it works!

Aug 01 '24 11:08 J08nY

brilliant! I'll get this released then :)

Aug 01 '24 13:08 joerick

Released! Big thank you for all your help with this @J08nY !

Aug 01 '24 21:08 joerick