Significant performance overhead when timing requires a syscall on Linux
~Hi, I recently saw a performance regression while using pyinstrument. It seems to be connected to the 4.5.1 -> 4.5.3 update but also might be related to Python version (3.9.17 to 3.9.18).~ Update: This is likely related to sympy library use.
I have two Github CI runs which use pyinstrument to perf some code that uses sympy.
Good run
https://github.com/J08nY/pyecsca/actions/runs/6039852294/job/16389425413
In the good run, pyinstrument is 4.5.1 and Python is 3.9.17 and the logs look reasonable (look here for example, it takes about 1.2 seconds).
Bad run
https://github.com/J08nY/pyecsca/actions/runs/6247569280/job/16960430761
In the bad run, pyinstrument is 4.5.3 and Python is 3.9.18 and the logs look very long
(look here for example, it now takes 13.6 seconds). I understand the logs are quite verbose since I have the show_all option
enabled, but that does not explain the runtime increase.
Local test
Note that there are some additional calls to sympy added in commits between the two runs. However I believe these are not the cause of the performance regression, since I can observe the regression locally.
Running the code without pyinstrument I get about 5 second runtime:
start = datetime.now()
for _ in range(operations):
one_point = dbl(p256.curve.prime, one_point, **p256.curve.parameters)[0]
end = datetime.now()
print(end - start)
Profiling 5000 256-bit doubling formula (dbl2015rcb) executions...
0:00:05.042373
Running with pyinstrument I get about 90 seconds:
Profiling 5000 256-bit doubling formula (dbl2015rcb) executions...
_ ._ __/__ _ _ _ _ _/_ Recorded: 17:00:13 Samples: 90372
/_//_/// /_\ / //_// / //_'/ // Duration: 90.888 CPU time: 90.885
/ _/ v4.5.3
Program: pyecsca/test/ec/perf_formula.py
90.888 main test/ec/perf_formula.py:12
ββ 90.859 DoublingEFDFormula.__call__ pyecsca/ec/formula.py:223
ββ 85.403 DoublingEFDFormula.__validate_assumptions pyecsca/ec/formula.py:154
β ββ 42.555 sympify sympy/core/sympify.py:101
β β [30 frames hidden] sympy, <string>, <built-in>, ast
β ββ 20.666 Add.subs sympy/core/basic.py:838
β β [21 frames hidden] sympy
β ββ 5.970 Poly.__new__ sympy/polys/polytools.py:162
β β [13 frames hidden] sympy
β ββ 5.280 FiniteField.__call__ sympy/polys/domains/domain.py:379
β β [6 frames hidden] sympy
β ββ 5.184 ground_roots sympy/polys/polytools.py:3718
β β [11 frames hidden] sympy
β ββ 2.467 resolve pyecsca/ec/formula.py:197
β β ββ 1.241 resolve pyecsca/ec/formula.py:197
β ββ 1.135 FiniteField.__init__ sympy/polys/domains/finitefield.py:114
ββ 3.505 CodeOp.__call__ pyecsca/ec/op.py:146
β ββ 2.558 <module> None
β ββ 1.786 GMPMod.method pyecsca/ec/mod.py:115
ββ 1.075 [self] pyecsca/ec/formula.py
Note that it seems the pyinstrument version does not matter, as I get roughly 90 seconds even
if I use 4.3.0. I am using Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] locally.
I'm a bit confused, do you think the CPython upgrade caused the performance regression? Or was it always there?
Another thing that has changed between these runs is the github actions runner image. But there could also be other factors, like the runner on GHA that actually executed the job. If you can give me a local recreation of the performance issue, I can debug it though. If it's 1.5s when running with python and 90s when running with pyinstrument, it would indeed seem to be a bug.
I did some more digging and the culprit seems to be the sympy calls. The situation looks something like this (pyinstrument 4.5.3, sympy 1.12, Python 3.11.5):
| No sympy | sympy | |
|---|---|---|
| No pyinstrument | 5s | 5s |
| pyinstrument | 5s | 90s |
The sympy calls are really not that performance intensive, in the non-instrumented run they take almost no time at all.
If you can give me a local recreation of the performance issue, I can debug it though. If it's 1.5s when running with
pythonand 90s when running withpyinstrument, it would indeed seem to be a bug.
I will try to create a simple reproducer. Should not be so hard.
All right, here is a reproducer (sympy 1.12, pyinstrument 4.5.3, Python 3.11.5).
With python sympy_instrument.py, prints This took 0:00:00.636278
With pyinstrument sympy_instrument.py, prints This took 0:00:12.355938
from datetime import datetime
from sympy import sympify, FF, symbols, Poly, Rational
def do_thing():
# Some elliptic curve crypto stuff that is not important
field = 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFF
params = {
"a": 0xFFFFFFFF00000001000000000000000000000000FFFFFFFFFFFFFFFFFFFFFFFC,
"b": 0x5AC635D8AA3A93E7B3EBBD55769886BC651D06B0CC53B0F63BCE3C3E27D2604B,
}
k = FF(field)
expr = sympify(f"3*b - b3", evaluate=False)
for curve_param, value in params.items():
expr = expr.subs(curve_param, k(value))
param = str(expr.free_symbols.pop())
def resolve(expression, k):
if not expression.args:
return expression
args = []
for arg in expression.args:
if isinstance(arg, Rational):
a = arg.p
b = arg.q
res = k(a) / k(b)
else:
res = resolve(arg, k)
args.append(res)
return expression.func(*args)
expr = resolve(expr, k)
poly = Poly(expr, symbols(param), domain=k)
roots = poly.ground_roots()
for root in roots:
params[param] = int(root)
break
if __name__ == "__main__":
start = datetime.now()
for _ in range(1000):
do_thing()
end = datetime.now()
print("This took", end - start)
The reproducer above, when run with varying the number of iterations leads to this runtime graph.
I also ran it with cProfile, which shows a similar slowdown. Then I tried an empty sys.setprofile function,
that showed no statistically significant overhead (so is not on the plot). Finally, I tried including just time.time()
inside the sys.setprofile function and that seems to match the cProfile/pyinstrument slowdown well.
My conclusion from that is:
- Sympy triggers a ton of profiling events.
- To know when to sample, pyinstrument needs to know the time, so every profiling event needs to call some timer.
- With a large amount of function calls that are very fast (so lots of profiling events), this overhead of getting the time (and then skipping the profiling event) is large.
This is highly platform-dependent, I think. Running 1000 iterations on my Mac, I get
$ python examples/sympy_perf.py
This took 0:00:00.403314
$ pyinstrument examples/sympy_perf.py
This took 0:00:00.845693
Which admittedly is a 100% overhead, but nothing like the kind of deltas you're getting here.
What platform are you running on? We have had reports of stuff like this in Docker containers, for example.
Having this reproducer is really useful, thank you for that. I think there are some low-level things that can improve performance, once I know what platform you're on.
I'm assuming you're running this on Linux? I want to test a theory, could you run this code and tell me the output?
#include <stdio.h>
#include <elf.h>
#include <sys/auxv.h>
int main() {
unsigned long vdso_start = getauxval(AT_SYSINFO_EHDR);
if (vdso_start) {
printf("vDSO is active at address: %lx\n", vdso_start);
} else {
printf("vDSO does not seem to be active.\n");
}
return 0;
}
Save that to a file vdso.c, compile it with clang vdso.c (you might have to apt install clang or similar), then run it with ./a.out.
The vDSO is a mechanism where the kernel makes the timing function gettimeofday really fast by injecting code into userspace to avoid a syscall. But I wonder if it doesn't happen in some cases, leading to these performance issues.
What platform are you running on? We have had reports of stuff like this in Docker containers, for example.
Having this reproducer is really useful, thank you for that. I think there are some low-level things that can improve performance, once I know what platform you're on.
Interesting, I am running ArchLinux (6.5.5 kernel) on x86_64. But also saw this in GitHub CI. I am not sure what other details are important.
Btw. I think a simple reproducer is the following:
from datetime import datetime
def func():
pass
def do_func_loop():
for _ in range(6000):
func()
if __name__ == "__main__":
start = datetime.now()
for _ in range(1000):
do_func_loop()
end = datetime.now()
print((end - start).total_seconds())
I get:
$ python repro.py
0.185707
$ pyinstrument repro.py
15.766407
Save that to a file
vdso.c, compile it withclang vdso.c(you might have toapt install clangor similar), then run it with./a.out.
Did this, said vDSO is active at address: 7fffff7af000.
I could perhaps try measuring performance of the gettimeofday in C to confirm it is indeed the culprit.
I could perhaps try measuring performance of the
gettimeofdayin C to confirm it is indeed the culprit.
Yeah, this would be the next step, I suppose. There are also other ways to get the time on POSIX, like clock_gettime(). I'll see about writing a script that compares them.
Here's a script:
#include <time.h>
#include <stdio.h>
#include <sys/time.h>
typedef enum {
CT_BASELINE,
CT_GETTIMEOFDAY,
CT_MONOTONIC,
CT_MONOTONIC_RAW,
CT_CLOCK,
CT_REALTIME,
CT_MONOTONIC_COARSE,
} ClockType;
static volatile double dummy_clock = 0.0;
static double
floatclock(ClockType clock_type)
{
switch (clock_type) {
case CT_BASELINE:
return dummy_clock;
case CT_GETTIMEOFDAY: {
struct timeval t;
gettimeofday(&t, (struct timezone *)NULL);
return (double)t.tv_sec + t.tv_usec*0.000001;
}
case CT_MONOTONIC: {
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
}
case CT_MONOTONIC_RAW: {
struct timespec t;
clock_gettime(CLOCK_MONOTONIC_RAW, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
}
case CT_CLOCK:
return (double)clock() / CLOCKS_PER_SEC;
case CT_REALTIME: {
struct timespec t;
clock_gettime(CLOCK_REALTIME, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
}
case CT_MONOTONIC_COARSE: {
#ifdef CLOCK_MONOTONIC_COARSE
struct timespec t;
clock_gettime(CLOCK_MONOTONIC_COARSE, &t);
return (double)t.tv_sec + t.tv_nsec*0.000000001;
#endif
}
}
return 0.0;
}
static double timing_overhead(ClockType clock_type) {
int NUM_ITERATIONS = 1000000;
double counter = 0.0;
double start = floatclock(CT_MONOTONIC);
for (int i = 0; i < NUM_ITERATIONS; i++) {
counter += floatclock(clock_type);
}
double end = floatclock(CT_MONOTONIC);
if (counter == 0.0) {
printf("!no-op! ");
}
return end - start;
}
int main()
{
printf("Baseline: ");
printf("%.3f\n", timing_overhead(CT_BASELINE));
printf("gettimeofday: ");
printf("%.3f\n", timing_overhead(CT_GETTIMEOFDAY));
printf("clock_gettime(CLOCK_MONOTONIC): ");
printf("%.3f\n", timing_overhead(CT_MONOTONIC));
printf("clock_gettime(CLOCK_MONOTONIC_RAW): ");
printf("%.3f\n", timing_overhead(CT_MONOTONIC_RAW));
// clock isn't interesting imo
// printf("clock(): ");
// printf("%.3f\n", timing_overhead(CT_CLOCK));
printf("clock_gettime(CLOCK_REALTIME): ");
printf("%.3f\n", timing_overhead(CT_REALTIME));
printf("clock_gettime(CLOCK_MONOTONIC_COARSE): ");
printf("%.3f\n", timing_overhead(CT_MONOTONIC_COARSE));
return 0;
}
This is what it prints on my system (macOS 13.4, arm64 M1 Max).
$ ./a.out
Baseline: !no-op! 0.007
gettimeofday: 0.030
clock_gettime(CLOCK_MONOTONIC): 0.031
clock_gettime(CLOCK_MONOTONIC_RAW): 0.023
clock(): 0.288
clock_gettime(CLOCK_REALTIME): 0.022
clock_gettime(CLOCK_MONOTONIC_COARSE): !no-op! 0.003
In an aarch64 ubuntu Docker container on the same machine:
Baseline: !no-op! 0.006
gettimeofday: 0.054
clock_gettime(CLOCK_MONOTONIC): 0.031
clock_gettime(CLOCK_MONOTONIC_RAW): 0.025
clock(): 0.423
clock_gettime(CLOCK_REALTIME): 0.025
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.005
I've tried the above script on a few machines. My conclusion is that it's not as simple as the existence of the VDSO. It always seems to be there in my testing. But, the VDSO will make a syscall if it doesn't support the current clock source. If it makes a syscall, it's really slow.
You can inspect the clock source like this:
$ # get the clock source
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
arch_sys_counter
$ # get available clock sources on this machine
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
arch_sys_counter
$ # I believe you can set the current clock source using this:
$ # echo tsc > /sys/devices/system/clocksource/clocksource0/current_clocksource
In my testing, running Docker containers on arm64 uses a clock source arch_sys_counter, which is supported by VDSO, but running Docker containers on x86_64 gives hyperv_clocksource_tsc_page, which isn't supported by VDSO, making it slow. I suspect something similar is happening to you @J08nY . The surefire way to test this theory is to run the program above with strace, e.g. strace -c ./a.out (maybe reducing the iterations a bit). You can see which clock sources hit syscalls.
This is what it prints on my system (macOS 13.4, arm64 M1 Max).
I get the following:
Baseline: !no-op! 0.007
gettimeofday: 1.213
clock_gettime(CLOCK_MONOTONIC): 1.217
clock_gettime(CLOCK_MONOTONIC_RAW): 1.218
clock_gettime(CLOCK_REALTIME): 1.215
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.007
You can inspect the clock source like this:
I get:
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm
The surefire way to test this theory is to run the program above with
strace, e.g.strace -c ./a.out(maybe reducing the iterations a bit). You can see which clock sources hit syscalls.
Tested this with just 1000 iterations, I get:
Baseline: !no-op! 0.000
gettimeofday: 0.049
clock_gettime(CLOCK_MONOTONIC): 0.045
clock_gettime(CLOCK_MONOTONIC_RAW): 0.053
clock_gettime(CLOCK_REALTIME): 0.046
clock_gettime(CLOCK_MONOTONIC_COARSE): 0.000
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
72,20 0,017519 5 3012 clock_gettime
23,34 0,005664 5 1000 gettimeofday
2,20 0,000533 533 1 execve
0,78 0,000189 23 8 mmap
0,40 0,000096 16 6 write
0,21 0,000052 17 3 newfstatat
0,20 0,000049 24 2 pread64
0,16 0,000040 20 2 openat
0,09 0,000023 7 3 mprotect
0,07 0,000018 18 1 munmap
0,07 0,000017 5 3 brk
0,05 0,000013 13 1 1 access
0,05 0,000012 12 1 read
0,05 0,000012 6 2 close
0,05 0,000012 6 2 1 arch_prctl
0,02 0,000004 4 1 getrandom
0,01 0,000003 3 1 set_tid_address
0,01 0,000003 3 1 prlimit64
0,01 0,000003 3 1 rseq
0,01 0,000002 2 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100,00 0,024264 5 4052 2 total
So it seems all except the MONOTONIC_COARSE are going through a syscall.
I found two nice blogposts discussing the situation, apparently it has affected quite some projects before. This post has some code for benchmarking the different clock sources. With it I get:
clocks.c
clock res (ns) secs nsecs
gettimeofday 1,000 1,696,777,115 253,319,000
CLOCK_REALTIME 1 1,696,777,115 253,328,015
CLOCK_REALTIME_COARSE 3,333,333 1,696,777,115 250,281,928
CLOCK_MONOTONIC 1 257,505 149,471,014
CLOCK_MONOTONIC_RAW 1 257,501 989,271,174
CLOCK_MONOTONIC_COARSE 3,333,333 257,505 146,418,780
ClockBench.cpp
Method samples min max avg median stdev
CLOCK_REALTIME 1023 907. 84578. 1297.50 42742.50 2635.12
CLOCK_REALTIME_COARSE 1023 0. 0. 0. 0. 0.
CLOCK_MONOTONIC 1023 838. 1467. 1093.37 1152.50 216.66
CLOCK_MONOTONIC_RAW 1023 907. 1467. 1204.10 1187.00 237.25
CLOCK_MONOTONIC_COARSE 1023 0. 0. 0. 0. 0.
cpuid+rdtsc 1023 51. 68. 57.88 59.50 8.34
rdtscp 1023 17. 34. 28.33 25.50 8.01
rdtsc 1023 1. 17. 15.24 9.00 4.85
Using CPU frequency = 1.000000
The COARSE clocks seem unreasonably fast, but if I understand their vdso implementation, it amounts to basically just a memory read, so these could be the real numbers. They also have low precision.
I tried changing the gettimeofday call into a clock_gettime call for CLOCK_MONOTONIC_COARSE in pyinstrument and these are the times I got for the reproducer. I think it may be worthwhile to make pyinstrument more robust to slow clocks by having some mechanism that tests them and picks one that works (perhaps warning the user if the clock chosen cannot satisfy the requested profiling interval.
@J08nY I've started some work to improve this here #273
It's still very draft, but on the off-chance that you have time, I'm actually currently hitting a problem on Linux that would be great to get some debugging help on. On macos I can test bits of the C API from ctypes using ctypes.pythonapi, but it doesn't work on linux for some reason. if you get a chance to debug that would help speed things along :)
Perhaps ctypes.CDLL(None) would work, or maybe the linker didn't actually load the library yet, since we didn't actually try to call into it before we try to load the symbol.
I will have a look at that PR.
It's been a while, but if you had a chance to look at #273, that would be great, @J08nY . The two approaches have been implemented. I'd also be curious to see the output of python metrics/count_samples.py on your machine.
Here is the count_samples output:
pyinstrument: the timer on your system has an overhead of 1201 nanoseconds,
which is considered high. You might experience longer runtimes than usual, and
programs with lots of pure-python code might be distorted.
You might want to try the timing thread option, which can be enabled using
--use-timing-thread at the command line, or by setting the use_timing_thread
parameter in the Profiler constructor.
Your system does offer a 'coarse' timer, with a lower overhead (0 nanoseconds).
You can enable it by setting pyinstrument's interval to a value higher than
0.00333 seconds. If you're happy with the lower precision, this is the best
option.
If you want to suppress this warning, you can set the environment variable
PYINSTRUMENT_IGNORE_OVERHEAD_WARNING to '1'.
Interval: 0.001, use_timing_thread: False
Expected samples: 100.0, actual samples: 99, 0.99x
Interval: 0.001, use_timing_thread: True
Expected samples: 100.0, actual samples: 91, 0.91x
Interval: 0.0001, use_timing_thread: False
Expected samples: 1000.0, actual samples: 992, 0.99x
Interval: 0.0001, use_timing_thread: True
Expected samples: 1000.0, actual samples: 601, 0.60x
Trying the examples/sympy_perf.py I get:
| Variant | Output |
|---|---|
| Python | This took 0:00:00.590824 |
| pyinstrument #273, no other CLI options | This took 0:00:11.999592 plus the new warning. |
pyinstrument #273 with --use-timing-thread |
This took 0:00:01.320666 |
pyinstrument #273 with -i 0.00334 |
This took 0:00:12.135095 |
So I guess the low-overhead timer is not getting used? Or still taking a lot of time?
Hmm, interesting! Thank you for the detailed output. Yes, I think you're right, it looks like the lower overhead timer isn't being used still. This is probably a simple coding oversight. I'll have a proper look tomorrow.
I am encouraged by the performance of the timing thread on your machine though! I get slightly worse results on my machine, in that the profiled thread doesn't see the time update as frequently- I get around 0.75x samples that way. I think the reasons for this are related to processor cache or other low-level stuff. I could fully mutex lock around that time variable but I worry that would send us back into syscall land. With sample counts of 0.75x or higher, I think I'm happy with this approach.
That issue with the coarse timer should be resolved, would you mind to check it again?
Now (fbc1f18) I get:
| Variant | Output |
|---|---|
| pyinstrument #273, no other CLI options | This took 0:00:12.200014 plus the new warning. |
pyinstrument #273 with --use-timing-thread |
This took 0:00:01.238449 |
pyinstrument #273 with -i 0.00334 |
This took 0:00:01.306539 |
pyinstrument #273 with -i 0.01 |
This took 0:00:01.260370 |
So it looks like it works!
brilliant! I'll get this released then :)
Released! Big thank you for all your help with this @J08nY !