pytest-benchmark icon indicating copy to clipboard operation
pytest-benchmark copied to clipboard

resource (CPU, RAM, I/O) usage tracking

Open anarcat opened this issue 8 years ago • 18 comments

the resource library allows Python processes to track memory usage and such things.

forking may be necessary to properly monitor each test invidually in this case. a separate data structure should also be built parallel to results to track other resources.

there seems to be a way with getrusage(2) to get information about threads as well, but this doesn't seem to be a good idea considering Python's limited support for threads and how the extension is Linux-specific.

i think the following data points could be collected:

0   ru_utime    time in user mode (float)
1   ru_stime    time in system mode (float)
2   ru_maxrss   maximum resident set size
9   ru_inblock  block input operations
10  ru_oublock  block output operations

those could be interesting but may be just adding too much noise:

3   ru_ixrss    shared memory size
4   ru_idrss    unshared memory size
5   ru_isrss    unshared stack size
6   ru_minflt   page faults not requiring I/O
7   ru_majflt   page faults requiring I/O
8   ru_nswap    number of swap outs
11  ru_msgsnd   messages sent
12  ru_msgrcv   messages received
13  ru_nsignals signals received
14  ru_nvcsw    voluntary context switches
15  ru_nivcsw   involuntary context switches

basically, this would extend the time metric to be an array of metrics with different units and so on...

would that be useful to others as well?

anarcat avatar Oct 06 '15 06:10 anarcat

I like the idea of measuring different metrics.

Unfortunately measuring memory use is a bit tricky. Because Python keeps memory internally for performance reasons (eg: free lists) it doesn't always give back the memory to the operating system. There's a http://pytracemalloc.readthedocs.org/ for more "fine" measurements however.

About the other metrics (page faults, signals, context switches) I'm not really sure. Ideally you wouldn't have any changes in those I guess :)

0 ru_utime time in user mode (float) 1 ru_stime time in system mode (float)

For these you can make a custom timer right now (--benchmark-timer). Only one can be shown right now. Showing aggregate metrics is a big display change, if you could mock up how the results table would look it would be really great.

ionelmc avatar Oct 06 '15 09:10 ionelmc

i understand this is a big change... with only sys/user/wall clocks, it could look like this:

Wall clock time:

Name (time in us)         Min      Max     Mean  StdDev  Rounds  Iterations
---------------------------------------------------------------------------
test_ssh_cmd          24.3047  73.0390  24.8964  1.0002    6323           1
---------------------------------------------------------------------------

System time:

Name (time in us)         Min      Max     Mean  StdDev  Rounds  Iterations
---------------------------------------------------------------------------
test_ssh_cmd          24.3047  73.0390  24.8964  1.0002    6323           1
---------------------------------------------------------------------------

User time:

Name (time in us)         Min      Max     Mean  StdDev  Rounds  Iterations
---------------------------------------------------------------------------
test_ssh_cmd          24.3047  73.0390  24.8964  1.0002    6323           1
---------------------------------------------------------------------------

ie. just split in multiple tables?

anarcat avatar Oct 06 '15 13:10 anarcat

That could work. I plan to rework the display logic at some point.

Another required thing is support for aggregates in the timing handling. There's one tricky thing - calibration would only be possible for a single metric. Hmmm ....

ionelmc avatar Oct 06 '15 15:10 ionelmc

On 2015-10-06 11:02:53, Ionel Cristian Mărieș wrote:

Another required thing is support for aggregates in the timing handling. There's one tricky thing - calibration would only be possible for a single metric. Hmmm ....

maybe we could assume the metrics are correlated for the purpose of calibration?

anarcat avatar Oct 06 '15 15:10 anarcat

Yes, I don't think we have a choice. :)

On Tue, Oct 6, 2015 at 6:17 PM, anarcat [email protected] wrote:

On 2015-10-06 11:02:53, Ionel Cristian Mărieș wrote:

Another required thing is support for aggregates in the timing handling. There's one tricky thing - calibration would only be possible for a single metric. Hmmm ....

maybe we could assume the metrics are correlated for the purpose of calibration?

ionelmc avatar Oct 06 '15 15:10 ionelmc

Hej, I'm also interested in including CPU, RAM, I/O usage. Just curious what's the state of this issue - any progress since Oct 2015 ?

varac avatar Mar 27 '17 14:03 varac

We could have this as option (--benchmark-memprofile + --benchmark-cpuprofile or similar names) or an always on (active if tracemalloc module availabe and whatever is needed for cpu profiling). The problem with the later is that behavior might be surprising (sudden and unnoticeable loss of functionality if some deps aren't there).

Ideally these stats would be included in the table somehow so you can compare/sort (unlike the --benchmark-cprofile feature where the profile data is completely outside).

The problem with including them in the timings table is that avg/mean/iqr/stddev colums would be completely unrelated to these cpu/men stats.

Also, can we agree that these cpu/men stats would be consistent and there's no need for calibration (iow: no rounds/iterations)? (iow: cpu/mem stats for a function they would not vary from run to run)

There are some interesting things to measure mem/cpu stats in https://github.com/haypo/perf - perhaps we can use it.

ionelmc avatar Apr 04 '17 19:04 ionelmc

Just fyi resource from stdlib isn't available on Windows. A cross-platform way would be:

>>> import os
>>> import psutil
>>> from binary import convert_units
>>>
>>> process = psutil.Process(os.getpid())
>>> convert_units(process.memory_info().rss)
(15.609375, 'MiB')

ofek avatar Apr 30 '18 17:04 ofek

In https://github.com/stellargraph/stellargraph, we're benchmarking/tracking memory-use of particular objects using a technique hacked on top of Python's built-in tracemalloc of creating a clock-like API for memory use: https://github.com/stellargraph/stellargraph/blob/15114f2258a93bc92f28882eea766639f3201ad1/tests/test_utils/alloc.py

Example test using it: https://github.com/stellargraph/stellargraph/blob/15114f2258a93bc92f28882eea766639f3201ad1/tests/core/test_stellargraph.py#L515-L532

@pytest.mark.benchmark(group="StellarGraph creation", timer=snapshot)
# various element counts, to give an indication of the relationship
# between those and memory use (0,0 gives the overhead of the
# StellarGraph object itself, without any data)
@pytest.mark.parametrize("num_nodes,num_edges", [(0, 0), (100, 200), (1000, 5000)])
# features or not, to capture their cost
@pytest.mark.parametrize("feature_size", [None, 100])
def test_allocation_benchmark_creation_from_networkx(
    allocation_benchmark, feature_size, num_nodes, num_edges
):
    g, node_features = example_benchmark_graph(
        feature_size, num_nodes, num_edges, features_in_nodes=True
    )

    def f():
        return StellarGraph(g, node_features=node_features)

    allocation_benchmark(f)

The key is the timer=snapshot and allocation_benchmark fixture.

huonw avatar Dec 05 '19 04:12 huonw

It would be a nice feature ! About the RAM tracking, https://github.com/python-recsys/benchy might be a source of inspiration ? It uses memory_profiler package

maingoh avatar Dec 17 '19 22:12 maingoh

hello is anything happened on adding the memory profile option?

LucaPaterlini avatar May 31 '20 09:05 LucaPaterlini

Not yet. Anyone interested in helping with this?

ionelmc avatar Nov 02 '20 09:11 ionelmc

note that https://github.com/P403n1x87/pytest-austin is a thing too...

anarcat avatar Nov 08 '20 18:11 anarcat

I think I got this working with a custom timer function:

import tracemalloc

tracemalloc.start()


def malloc():
    size, peak = tracemalloc.get_traced_memory()
    tracemalloc.clear_traces()

    return peak*4  # this factor seems to vary on operating systems

Then call pytest with --benchmark-timer timers.malloc assuming that the above function is in a timers.py file.

For this to work, you have to use pedanctic mode and define the rounds yourself. Also, there needs to be one warmup_round, as otherwise the first round leads to funny results.

AngryUbuntuNerd avatar Feb 15 '21 13:02 AngryUbuntuNerd

@AngryUbuntuNerd you may be interested in an expansion of https://github.com/ionelmc/pytest-benchmark/issues/28#issuecomment-561960809 above: since that comment was written, we've added benchmarking of peak memory too. New references:

  • code: https://github.com/stellargraph/stellargraph/blob/1e6120fcdbbedd3eb58e8fecc0eabc6999101ee6/tests/test_utils/alloc.py (see the peak function and MallocPeak class for how peak memory recording works)
  • example of a benchmark that records time, long-term memory use and peak memory use: https://github.com/stellargraph/stellargraph/blob/1e6120fcdbbedd3eb58e8fecc0eabc6999101ee6/tests/core/test_graph.py#L1054-L1103

Unfortunately, I think your approach doesn't quite do the right thing, because it does start = malloc(); <benchmark code>; end = malloc(); return end - start, as it's recording the peak before the benchmark starts, resets it, and then subtracting that first peak from the unrelated peak after running the benchmark. The MallocPeak class above avoids this via a custom __sub__ method.

huonw avatar Feb 15 '21 21:02 huonw

@huonw thanks for your solution

I think the tracemalloc way works as well though: the second malloc() will record the memory spike since the last call of clear_traces(). Tested it with a few MB and GB of memory eating, and different rounds/iterations, and the results are matching.

AngryUbuntuNerd avatar Feb 16 '21 14:02 AngryUbuntuNerd

Yeah, you're right that tracemalloc works really well for this: both solutions are using it. The only difference between them is the return value of the "timer" and how it computes the "duration" via a subtraction between the start "time" and the end "time".

(I think your * 4 factor may be driven by this because I don't think it should be necessary: tracemalloc is returning the exact byte size of the current/peak memory allocations it knows about, so multiplying by some arbitrary factor is... confusing.)

To show this in practice, I wrote a benchmark, which uses the basic one (without the * 4), and the one from StellarGraph above (that is, with a class with a custom __sub__): I've used a setup function with a very high peak (80MB, on a 64-bit computer) followed by a benchmark function with a lower peak (8MB), as an extreme example.

Expand for benchmark code
# issue_28.py
import pytest
import tracemalloc

tracemalloc.start()

class MallocPeak:
    def __init__(self, peak):
        self.peak = peak

    def __sub__(self, other):
        # ignore other
        return self.peak

def peak_default_sub():
    _, peak = tracemalloc.get_traced_memory()
    tracemalloc.clear_traces()
    return peak

def peak_fancy_sub():
    _, peak = tracemalloc.get_traced_memory()
    tracemalloc.clear_traces()
    return MallocPeak(peak)

def _setup():
    # setup code with a comically high peak, to make the issue clear
    x = [None] * 10_000_000 # 80MB
    del x

def _func():
    # real code has a lower peak than the setup
    y = [None] * 1_000_000 # 8MB
    del y

@pytest.mark.benchmark(timer=peak_default_sub)
def test_default_sub(benchmark):
    benchmark.pedantic(_func, iterations=1, rounds=1, warmup_rounds=0, setup=_setup)

@pytest.mark.benchmark(timer=peak_fancy_sub)
def test_fancy_sub(benchmark):
    benchmark.pedantic(_func, iterations=1, rounds=1, warmup_rounds=0, setup=_setup)

With that setup, the results of pytest issue_28.py are:

-------------------------------------------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------------------------------------------
Name (time in ns)                              Min                                     Max                                    Mean            StdDev                                  Median               IQR            Outliers     OPS            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_default_sub      -72,167,433,000,000,000.0000 (1.0)      -72,167,433,000,000,000.0000 (1.0)      -72,167,433,000,000,000.0000 (1.0)      0.0000 (1.0)      -72,167,433,000,000,000.0000 (1.0)      0.0000 (1.0)           0;0 -0.0000 (0.11)          1           1
test_fancy_sub          8,000,512,000,000,000.0000 (0.11)       8,000,512,000,000,000.0000 (0.11)       8,000,512,000,000,000.0000 (0.11)     0.0000 (1.0)        8,000,512,000,000,000.0000 (0.11)     0.0000 (1.0)           0;0  0.0000 (1.0)           1           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

(Note that it's reporting in "nanoseconds" because of the negative values: all the byte sizes are multiplied by 1_000_000_000.)

In summary:

  • test_default_sub is very incorrect (a negative peak!): it gives a value close to -72MB = 8MB - 80MB, because the timer is subtracting the peak of the setup function off the peak of the real benchmark
  • test_fancy_sub is correct: it gives almost exactly 8MB, which is the right answer, given the test function just allocates a 8MB list

This is exaggerated for clarity, but more realistic settings show the error too: for instance, change the setup function to have a peak of 8MB too (test_default_sub: -168KB; test_fancy_sub: 8MB, as above) or don't use a setup function at all (test_default_sub: 7.8MB, test_fancy_sub: 8MB, as above).

Why is this happening? If we unroll the loop for the test_default_sub benchmark and inline the functions, we effectively get:

# do setup
_setup()

# snapshot 'time'/size before
_, peak_before = tracemalloc.get_traced_memory() # peak_before = 80MB
tracemalloc.clear_traces() # reset the peak to 0, but peak_before = 80MB still

# run benchmarked function
_func()

# snapshot 'time'/size after
_, peak_after = tracemalloc.get_traced_memory() # peak_after = 8MB
tracemalloc.clear_traces()

# compute result 'duration'
result = peak_after - peak_before # 8MB - 80MB = -72MB

In particular, the 'before' value needs to be 0, because the clear_traces means that's where we're starting. AFAIK, the best way to achieve this is to use a custom subtraction on the result line, because it's hard to know inside a timer call if it's the 'before' snapshot (to return 0) or the 'after' one (to return the real peak).

huonw avatar Feb 16 '21 21:02 huonw

I haven't used it, but I imagine that pytest-memray might be a good alternative, instead of profiling memory directly in this pytest extension. This appears to be a tricky task, so maybe leave it to the well-funded experts at Bloomberg to implement it for us, and then we just use two tools, each one good for what it's good for.

NickCrews avatar Feb 19 '24 22:02 NickCrews