pytest-benchmark
pytest-benchmark copied to clipboard
resource (CPU, RAM, I/O) usage tracking
the resource library allows Python processes to track memory usage and such things.
forking may be necessary to properly monitor each test invidually in this case. a separate data structure should also be built parallel to results to track other resources.
there seems to be a way with getrusage(2)
to get information about threads as well, but this doesn't seem to be a good idea considering Python's limited support for threads and how the extension is Linux-specific.
i think the following data points could be collected:
0 ru_utime time in user mode (float)
1 ru_stime time in system mode (float)
2 ru_maxrss maximum resident set size
9 ru_inblock block input operations
10 ru_oublock block output operations
those could be interesting but may be just adding too much noise:
3 ru_ixrss shared memory size
4 ru_idrss unshared memory size
5 ru_isrss unshared stack size
6 ru_minflt page faults not requiring I/O
7 ru_majflt page faults requiring I/O
8 ru_nswap number of swap outs
11 ru_msgsnd messages sent
12 ru_msgrcv messages received
13 ru_nsignals signals received
14 ru_nvcsw voluntary context switches
15 ru_nivcsw involuntary context switches
basically, this would extend the time
metric to be an array of metrics with different units and so on...
would that be useful to others as well?
I like the idea of measuring different metrics.
Unfortunately measuring memory use is a bit tricky. Because Python keeps memory internally for performance reasons (eg: free lists) it doesn't always give back the memory to the operating system. There's a http://pytracemalloc.readthedocs.org/ for more "fine" measurements however.
About the other metrics (page faults, signals, context switches) I'm not really sure. Ideally you wouldn't have any changes in those I guess :)
0 ru_utime time in user mode (float)
1 ru_stime time in system mode (float)
For these you can make a custom timer right now (--benchmark-timer
). Only one can be shown right now. Showing aggregate metrics is a big display change, if you could mock up how the results table would look it would be really great.
i understand this is a big change... with only sys/user/wall clocks, it could look like this:
Wall clock time:
Name (time in us) Min Max Mean StdDev Rounds Iterations
---------------------------------------------------------------------------
test_ssh_cmd 24.3047 73.0390 24.8964 1.0002 6323 1
---------------------------------------------------------------------------
System time:
Name (time in us) Min Max Mean StdDev Rounds Iterations
---------------------------------------------------------------------------
test_ssh_cmd 24.3047 73.0390 24.8964 1.0002 6323 1
---------------------------------------------------------------------------
User time:
Name (time in us) Min Max Mean StdDev Rounds Iterations
---------------------------------------------------------------------------
test_ssh_cmd 24.3047 73.0390 24.8964 1.0002 6323 1
---------------------------------------------------------------------------
ie. just split in multiple tables?
That could work. I plan to rework the display logic at some point.
Another required thing is support for aggregates in the timing handling. There's one tricky thing - calibration would only be possible for a single metric. Hmmm ....
On 2015-10-06 11:02:53, Ionel Cristian Mărieș wrote:
Another required thing is support for aggregates in the timing handling. There's one tricky thing - calibration would only be possible for a single metric. Hmmm ....
maybe we could assume the metrics are correlated for the purpose of calibration?
Yes, I don't think we have a choice. :)
On Tue, Oct 6, 2015 at 6:17 PM, anarcat [email protected] wrote:
On 2015-10-06 11:02:53, Ionel Cristian Mărieș wrote:
Another required thing is support for aggregates in the timing handling. There's one tricky thing - calibration would only be possible for a single metric. Hmmm ....
maybe we could assume the metrics are correlated for the purpose of calibration?
Hej, I'm also interested in including CPU, RAM, I/O usage. Just curious what's the state of this issue - any progress since Oct 2015 ?
We could have this as option (--benchmark-memprofile
+ --benchmark-cpuprofile
or similar names) or an always on (active if tracemalloc
module availabe and whatever is needed for cpu profiling). The problem with the later is that behavior might be surprising (sudden and unnoticeable loss of functionality if some deps aren't there).
Ideally these stats would be included in the table somehow so you can compare/sort (unlike the --benchmark-cprofile
feature where the profile data is completely outside).
The problem with including them in the timings table is that avg/mean/iqr/stddev colums would be completely unrelated to these cpu/men stats.
Also, can we agree that these cpu/men stats would be consistent and there's no need for calibration (iow: no rounds/iterations)? (iow: cpu/mem stats for a function they would not vary from run to run)
There are some interesting things to measure mem/cpu stats in https://github.com/haypo/perf - perhaps we can use it.
Just fyi resource
from stdlib isn't available on Windows. A cross-platform way would be:
>>> import os
>>> import psutil
>>> from binary import convert_units
>>>
>>> process = psutil.Process(os.getpid())
>>> convert_units(process.memory_info().rss)
(15.609375, 'MiB')
In https://github.com/stellargraph/stellargraph, we're benchmarking/tracking memory-use of particular objects using a technique hacked on top of Python's built-in tracemalloc
of creating a clock-like API for memory use: https://github.com/stellargraph/stellargraph/blob/15114f2258a93bc92f28882eea766639f3201ad1/tests/test_utils/alloc.py
Example test using it: https://github.com/stellargraph/stellargraph/blob/15114f2258a93bc92f28882eea766639f3201ad1/tests/core/test_stellargraph.py#L515-L532
@pytest.mark.benchmark(group="StellarGraph creation", timer=snapshot)
# various element counts, to give an indication of the relationship
# between those and memory use (0,0 gives the overhead of the
# StellarGraph object itself, without any data)
@pytest.mark.parametrize("num_nodes,num_edges", [(0, 0), (100, 200), (1000, 5000)])
# features or not, to capture their cost
@pytest.mark.parametrize("feature_size", [None, 100])
def test_allocation_benchmark_creation_from_networkx(
allocation_benchmark, feature_size, num_nodes, num_edges
):
g, node_features = example_benchmark_graph(
feature_size, num_nodes, num_edges, features_in_nodes=True
)
def f():
return StellarGraph(g, node_features=node_features)
allocation_benchmark(f)
The key is the timer=snapshot
and allocation_benchmark
fixture.
It would be a nice feature ! About the RAM tracking, https://github.com/python-recsys/benchy might be a source of inspiration ? It uses memory_profiler package
hello is anything happened on adding the memory profile option?
Not yet. Anyone interested in helping with this?
note that https://github.com/P403n1x87/pytest-austin is a thing too...
I think I got this working with a custom timer function:
import tracemalloc
tracemalloc.start()
def malloc():
size, peak = tracemalloc.get_traced_memory()
tracemalloc.clear_traces()
return peak*4 # this factor seems to vary on operating systems
Then call pytest with --benchmark-timer timers.malloc
assuming that the above function is in a timers.py
file.
For this to work, you have to use pedanctic mode and define the rounds yourself. Also, there needs to be one warmup_round, as otherwise the first round leads to funny results.
@AngryUbuntuNerd you may be interested in an expansion of https://github.com/ionelmc/pytest-benchmark/issues/28#issuecomment-561960809 above: since that comment was written, we've added benchmarking of peak memory too. New references:
- code: https://github.com/stellargraph/stellargraph/blob/1e6120fcdbbedd3eb58e8fecc0eabc6999101ee6/tests/test_utils/alloc.py (see the
peak
function andMallocPeak
class for how peak memory recording works) - example of a benchmark that records time, long-term memory use and peak memory use: https://github.com/stellargraph/stellargraph/blob/1e6120fcdbbedd3eb58e8fecc0eabc6999101ee6/tests/core/test_graph.py#L1054-L1103
Unfortunately, I think your approach doesn't quite do the right thing, because it does start = malloc(); <benchmark code>; end = malloc(); return end - start
, as it's recording the peak before the benchmark starts, resets it, and then subtracting that first peak from the unrelated peak after running the benchmark. The MallocPeak
class above avoids this via a custom __sub__
method.
@huonw thanks for your solution
I think the tracemalloc way works as well though: the second malloc()
will record the memory spike since the last call of clear_traces()
. Tested it with a few MB and GB of memory eating, and different rounds/iterations, and the results are matching.
Yeah, you're right that tracemalloc works really well for this: both solutions are using it. The only difference between them is the return value of the "timer" and how it computes the "duration" via a subtraction between the start "time" and the end "time".
(I think your * 4
factor may be driven by this because I don't think it should be necessary: tracemalloc is returning the exact byte size of the current/peak memory allocations it knows about, so multiplying by some arbitrary factor is... confusing.)
To show this in practice, I wrote a benchmark, which uses the basic one (without the * 4
), and the one from StellarGraph above (that is, with a class with a custom __sub__
): I've used a setup function with a very high peak (80MB, on a 64-bit computer) followed by a benchmark function with a lower peak (8MB), as an extreme example.
Expand for benchmark code
# issue_28.py
import pytest
import tracemalloc
tracemalloc.start()
class MallocPeak:
def __init__(self, peak):
self.peak = peak
def __sub__(self, other):
# ignore other
return self.peak
def peak_default_sub():
_, peak = tracemalloc.get_traced_memory()
tracemalloc.clear_traces()
return peak
def peak_fancy_sub():
_, peak = tracemalloc.get_traced_memory()
tracemalloc.clear_traces()
return MallocPeak(peak)
def _setup():
# setup code with a comically high peak, to make the issue clear
x = [None] * 10_000_000 # 80MB
del x
def _func():
# real code has a lower peak than the setup
y = [None] * 1_000_000 # 8MB
del y
@pytest.mark.benchmark(timer=peak_default_sub)
def test_default_sub(benchmark):
benchmark.pedantic(_func, iterations=1, rounds=1, warmup_rounds=0, setup=_setup)
@pytest.mark.benchmark(timer=peak_fancy_sub)
def test_fancy_sub(benchmark):
benchmark.pedantic(_func, iterations=1, rounds=1, warmup_rounds=0, setup=_setup)
With that setup, the results of pytest issue_28.py
are:
-------------------------------------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------------------------------------
Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_default_sub -72,167,433,000,000,000.0000 (1.0) -72,167,433,000,000,000.0000 (1.0) -72,167,433,000,000,000.0000 (1.0) 0.0000 (1.0) -72,167,433,000,000,000.0000 (1.0) 0.0000 (1.0) 0;0 -0.0000 (0.11) 1 1
test_fancy_sub 8,000,512,000,000,000.0000 (0.11) 8,000,512,000,000,000.0000 (0.11) 8,000,512,000,000,000.0000 (0.11) 0.0000 (1.0) 8,000,512,000,000,000.0000 (0.11) 0.0000 (1.0) 0;0 0.0000 (1.0) 1 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(Note that it's reporting in "nanoseconds" because of the negative values: all the byte sizes are multiplied by 1_000_000_000.)
In summary:
-
test_default_sub
is very incorrect (a negative peak!): it gives a value close to-72MB = 8MB - 80MB
, because the timer is subtracting the peak of the setup function off the peak of the real benchmark -
test_fancy_sub
is correct: it gives almost exactly 8MB, which is the right answer, given the test function just allocates a 8MB list
This is exaggerated for clarity, but more realistic settings show the error too: for instance, change the setup function to have a peak of 8MB too (test_default_sub
: -168KB; test_fancy_sub
: 8MB, as above) or don't use a setup function at all (test_default_sub
: 7.8MB, test_fancy_sub
: 8MB, as above).
Why is this happening? If we unroll the loop for the test_default_sub
benchmark and inline the functions, we effectively get:
# do setup
_setup()
# snapshot 'time'/size before
_, peak_before = tracemalloc.get_traced_memory() # peak_before = 80MB
tracemalloc.clear_traces() # reset the peak to 0, but peak_before = 80MB still
# run benchmarked function
_func()
# snapshot 'time'/size after
_, peak_after = tracemalloc.get_traced_memory() # peak_after = 8MB
tracemalloc.clear_traces()
# compute result 'duration'
result = peak_after - peak_before # 8MB - 80MB = -72MB
In particular, the 'before' value needs to be 0, because the clear_traces
means that's where we're starting. AFAIK, the best way to achieve this is to use a custom subtraction on the result
line, because it's hard to know inside a timer call if it's the 'before' snapshot (to return 0) or the 'after' one (to return the real peak).
I haven't used it, but I imagine that pytest-memray might be a good alternative, instead of profiling memory directly in this pytest extension. This appears to be a tricky task, so maybe leave it to the well-funded experts at Bloomberg to implement it for us, and then we just use two tools, each one good for what it's good for.