tracy Profiling MPI applications with Tracy can cause long spikes in non-blocking send/receive operations

We are using Tracy to profile our distributed GPU runtime system Celerity, and it's mostly working great. However, during some recent benchmarking runs on the Leonardo supercomputer we've noticed that traces often contain very long spikes for MPI non-blocking send / receive operations, with some transfers taking several thousand times longer than they should (e.g. 30ms instead of 10us, sometimes even > 100ms).

Here's an example trace for a run with 32 ranks. Notice how there are several small gaps throughout the run and a few very long ones towards the end, caused by long transfers (in the "p2p" fibers at the very bottom). The application is a simple stencil code executed over 10000 iterations, with each iteration performing exactly the same set of operations (point to point transfers between ranks, some copies as well as GPU kernel executions).

Long story short, it turns out that those spikes only happen while profiling with Tracy, and therefore seem to be due to some unfortunate interaction between the Tracy client and MPI.

What is very curious is that the gaps happen at seemingly predictable phases of the program's execution.

Here is another trace of the same application / configuration. Notice how the pattern of gaps looks very similar, although in this case the long gap towards the end is quite a bit shorter.

I've managed to create a small-ish reproducer program, in case anyone is interested:

#include <numeric>
#include <optional>
#include <functional>
#include <algorithm>
#include <array>
#include <vector>
#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <thread>
#include <deque>

#include <mpi.h>
#include <tracy/Tracy.hpp>
#include <tracy/TracyC.h>

using clk = std::chrono::steady_clock;
using namespace std::chrono_literals;

int main(int argc, char* argv[]) {
	const size_t transfer_bytes = argc > 1 ? std::atol(argv[1]) : 16384 * 4;
	const size_t iterations = 10000;
	const size_t warmup = 100;

	int provided = -1;
	MPI_Init_thread(nullptr, nullptr, MPI_THREAD_MULTIPLE, &provided);

	int rank = -1;
	int size = -1;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Comm_size(MPI_COMM_WORLD, &size);

	if(rank == 0) {
		printf("Transferring %zu bytes %zu times, %zu warmup iterations\n", transfer_bytes, iterations, warmup);
		fflush(stdout);
	}
	MPI_Barrier(MPI_COMM_WORLD);

	std::vector<unsigned char> sendbuf_up(transfer_bytes);
	std::vector<unsigned char> recvbuf_up(transfer_bytes);
	std::vector<unsigned char> sendbuf_down(transfer_bytes);
	std::vector<unsigned char> recvbuf_down(transfer_bytes);

	std::vector<clk::duration> times;
	times.reserve(iterations);

	const int up = rank > 0 ? rank - 1 : size - 1;
	const int down = rank < size - 1 ? rank + 1 : 0;
	const int tag_up = 0;
	const int tag_down = 1;
	for(size_t i = 0; i < iterations + warmup; ++i) {
		const auto before = clk::now();
		MPI_Request reqs[4];

		MPI_Irecv(recvbuf_up.data(), transfer_bytes, MPI_BYTE, up, tag_down + i, MPI_COMM_WORLD, &reqs[0]);
		MPI_Irecv(recvbuf_down.data(), transfer_bytes, MPI_BYTE, down, tag_up + i, MPI_COMM_WORLD, &reqs[1]);
		MPI_Isend(sendbuf_up.data(), transfer_bytes, MPI_BYTE, up, tag_up + i, MPI_COMM_WORLD, &reqs[2]);
		MPI_Isend(sendbuf_down.data(), transfer_bytes, MPI_BYTE, down, tag_down + i, MPI_COMM_WORLD, &reqs[3]);

		bool done[4] = {false, false, false, false};
		bool all_done = false;
		while(!all_done) {
			ZoneScopedN("make things slow"); // <--- comment out this zone to remove spikes

			all_done = true;
			for(size_t j = 0; j < 4; ++j) {
				if(done[j]) continue;
				int flag = -1;
				MPI_Test(&reqs[j], &flag, MPI_STATUS_IGNORE);
				done[j] = flag != 0;
				all_done = all_done && done[j];
			}
		}

		const auto after = clk::now();
		if(i >= warmup) {
			times.push_back(after - before);
		}
	}

	MPI_Finalize();

	const auto sum = std::accumulate(times.begin(), times.end(), clk::duration{});
	const auto min = *std::min_element(times.begin(), times.end());
	const auto max = *std::max_element(times.begin(), times.end());

	printf("Rank %2d mean: %4zuus, min: %4zuus, max: %4zuus\n", rank, sum / 1us / iterations, min / 1us, max / 1us);

	return 0;
}

Obviously creating zones in a busy loop is not ideal, but this was the only way I could reproduce the effect in this small example. In our real application zones are submitted by different threads, including the thread that calls MPI_Test, but not for each iteration as is done here.

Here's the output when running on 32 ranks on Leonardo, with the ZoneScopedN enabled:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  7 mean:   20us, min:   13us, max: 26901us
Rank  4 mean:   20us, min:   13us, max: 31456us
Rank 17 mean:   20us, min:   12us, max: 25444us
Rank 14 mean:   20us, min:   12us, max: 21204us
Rank 19 mean:   20us, min:   13us, max: 25211us
Rank 31 mean:   20us, min:   13us, max: 31455us
Rank 10 mean:   20us, min:   13us, max: 20106us
Rank  5 mean:   20us, min:   13us, max: 31455us
Rank  1 mean:   20us, min:   12us, max: 31458us
Rank 11 mean:   20us, min:   13us, max: 18880us
Rank 18 mean:   20us, min:   13us, max: 25231us
Rank 28 mean:   20us, min:   13us, max: 31472us
Rank  3 mean:   20us, min:   13us, max: 31455us
Rank  2 mean:   20us, min:   13us, max: 31458us
Rank 23 mean:   20us, min:   12us, max: 25189us
Rank 27 mean:   20us, min:   13us, max: 25188us
Rank  0 mean:   20us, min:   13us, max: 31460us
Rank  6 mean:   20us, min:   12us, max: 26901us
Rank 16 mean:   20us, min:   12us, max: 25554us
Rank 24 mean:   20us, min:   12us, max: 25186us
Rank  8 mean:   20us, min:   12us, max: 24177us
Rank  9 mean:   20us, min:   12us, max: 19973us
Rank 13 mean:   20us, min:   12us, max: 21199us
Rank 20 mean:   20us, min:   12us, max: 25191us
Rank 29 mean:   20us, min:   12us, max: 31467us
Rank 30 mean:   20us, min:   12us, max: 31461us
Rank 25 mean:   20us, min:   12us, max: 25186us
Rank 15 mean:   20us, min:   12us, max: 25550us
Rank 21 mean:   20us, min:   12us, max: 25186us
Rank 26 mean:   20us, min:   12us, max: 25184us
Rank 12 mean:   20us, min:   12us, max: 16532us
Rank 22 mean:   20us, min:   12us, max: 25188us

And here's the output without the ZoneScopedN:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  8 mean:   15us, min:   12us, max:  250us
Rank 22 mean:   15us, min:   13us, max:  216us
Rank 20 mean:   15us, min:   12us, max:  216us
Rank 24 mean:   15us, min:   12us, max:  216us
Rank 15 mean:   15us, min:   12us, max:  249us
Rank 23 mean:   15us, min:   12us, max:  216us
Rank  0 mean:   15us, min:   13us, max:  251us
Rank 12 mean:   15us, min:   13us, max:  249us
Rank 25 mean:   15us, min:   12us, max:  217us
Rank 21 mean:   15us, min:   12us, max:  216us
Rank 19 mean:   15us, min:   12us, max:  215us
Rank  5 mean:   15us, min:   13us, max:  250us
Rank 14 mean:   15us, min:   12us, max:  249us
Rank  7 mean:   15us, min:   12us, max:  251us
Rank 29 mean:   15us, min:   12us, max:  217us
Rank 26 mean:   15us, min:   12us, max:  217us
Rank 17 mean:   15us, min:   12us, max:  216us
Rank  9 mean:   15us, min:   12us, max:  250us
Rank 18 mean:   15us, min:   12us, max:  217us
Rank  2 mean:   15us, min:   12us, max:  251us
Rank 11 mean:   15us, min:   12us, max:  249us
Rank 13 mean:   15us, min:   12us, max:  249us
Rank  4 mean:   15us, min:   13us, max:  251us
Rank  6 mean:   15us, min:   13us, max:  251us
Rank 28 mean:   15us, min:   12us, max:  216us
Rank 31 mean:   15us, min:   13us, max:  216us
Rank  3 mean:   15us, min:   13us, max:  251us
Rank  1 mean:   15us, min:   12us, max:  251us
Rank 27 mean:   15us, min:   12us, max:  216us
Rank 16 mean:   15us, min:   13us, max:  250us
Rank 10 mean:   15us, min:   12us, max:  249us
Rank 30 mean:   15us, min:   12us, max:  216us

I realize that this is a rather difficult issue to reproduce; I'm mainly opening it to see if anybody has any ideas as to what might be causing these spikes, or any suggestions for how to investigate this further.

One hypothesis we had was that somewhere inside MPI a OS / hardware interaction sometimes causes a thread to be scheduled out, and Tracy would get scheduled in (which could result in delays in the order of milliseconds). However, it is unlikely that this would result in a consistent gap pattern. Furthermore, we've tried explicitly setting the thread affinity for Tracy and all other application threads to ensure no overlap, but this does not seem to change anything (or at least not consistently; we've seen a couple of instances where it seemed to eliminate the gaps, but then wasn't reproducible).

Here's some additional things we've determined:

It only seems to happen for transfers over the actual network (no shared memory on a single node).
Reproducible for both OpenMPI and Intel MPI (MPICH).
The spikes actually happen somewhere inside calls to MPI_Test et al.; pre-loading a dummy MPI library that replaces MPI_Isend / MPI_Irecv / MPI_Test with no-ops eliminates the gaps.
It does not matter whether the trace is actually being consumed (e.g. via tracy-capture) or not

Jan 07 '25 15:01 psalz

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

Jan 07 '25 15:01 wolfpld

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

Yes, Celerity uses the fibers API to render concurrent tasks in our runtime. However, the reproducer code does not, it only uses a single ZoneScopedN!

Jan 07 '25 15:01 psalz

@psalz, depending on how I'm lucky, I can reproduce your numbers and cannot. Both for builds with Tracy and without.

I used Komondor HPC cluster with the following submit script:

#!/bin/bash

#SBATCH -A <ACCOUNT>
#SBATCH --partition=cpu
#SBATCH --job-name=TracyMPI
#SBATCH --time=0-00:10:00
#SBATCH --mem=200gb
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --output=slurm/slurm-%J.out
#SBATCH --error=slurm/slurm-%J.err

# LOAD GCC 14.1
# LOAD IntelMPI 2024.1

export PATH=.../tracy.git/capture/build:$PATH

mpirun ./TracyMPI &
tracy-capture -o tracy.${SLURM_JOBID}.${SLURM_LOCALID}.out

There is output with Tracy:

Connecting to 127.0.0.1:8086...Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  8 mean:   77us, min:   20us, max:  394us
Rank  0 mean:   77us, min:   20us, max:  519us
Rank  9 mean:   77us, min:   11us, max:  402us
Rank  1 mean:   77us, min:   14us, max:  500us
Rank 14 mean:   77us, min:   13us, max:  487us
Rank  2 mean:   77us, min:   15us, max:  498us
Rank 15 mean:   77us, min:   19us, max:  490us
Rank 11 mean:   77us, min:   11us, max:  408us
Rank 12 mean:   77us, min:   11us, max:  458us
Rank  3 mean:   77us, min:   15us, max:  493us
Rank 13 mean:   77us, min:   11us, max:  463us
Rank  4 mean:   77us, min:   15us, max:  423us
Rank 10 mean:   77us, min:   12us, max:  409us
Rank  5 mean:   77us, min:   17us, max:  424us
Rank  6 mean:   77us, min:   20us, max:  399us
Rank  7 mean:   77us, min:   21us, max:  394us

Queue delay: 12 ns
Timer resolution: 9 ns

Frames: 2
Time span: 1.66 s
Zones: 814,577
Elapsed time: 2.4 s
Saving trace... done!
Trace size 6281.06 KB (14.24% ratio)

Lucky run without Tracy:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  3 mean:   65us, min:   38us, max:  133us
Rank  0 mean:   65us, min:   23us, max:  199us
Rank  1 mean:   65us, min:   23us, max:  133us
Rank  2 mean:   65us, min:   38us, max:  133us
Rank  4 mean:   65us, min:   38us, max:  133us
Rank  5 mean:   65us, min:   41us, max:  134us
Rank  6 mean:   65us, min:   38us, max:  102us
Rank  7 mean:   65us, min:   29us, max:  148us
Rank  8 mean:   65us, min:   25us, max:  192us
Rank  9 mean:   65us, min:   20us, max:  170us
Rank 11 mean:   65us, min:   42us, max:  127us
Rank 15 mean:   65us, min:   25us, max:  152us
Rank 12 mean:   65us, min:   41us, max:  126us
Rank 14 mean:   65us, min:   41us, max:  126us
Rank 13 mean:   65us, min:   41us, max:  126us
Rank 10 mean:   65us, min:   23us, max:  127us

Unlucky run without Tracy:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  0 mean: 1442us, min:   19us, max: 33296us
Rank  1 mean: 1442us, min:   12us, max: 20987us
Rank  2 mean: 1442us, min:   14us, max: 20816us
Rank  3 mean: 1442us, min:   13us, max: 20654us
Rank  4 mean: 1442us, min:   14us, max: 20488us
Rank  5 mean: 1441us, min:   13us, max: 20338us
Rank  6 mean: 1440us, min:   14us, max: 20209us
Rank  7 mean: 1440us, min:   12us, max: 20091us
Rank  8 mean: 1440us, min:   12us, max: 19976us
Rank  9 mean: 1440us, min:   18us, max: 33705us
Rank 11 mean: 1440us, min:   21us, max: 19896us
Rank 12 mean: 1440us, min:   25us, max: 19965us
Rank 13 mean: 1440us, min:   23us, max: 20088us
Rank 14 mean: 1440us, min:   19us, max: 20204us
Rank 15 mean: 1442us, min:   22us, max: 20338us
Rank 16 mean: 1442us, min:   18us, max: 20531us
Rank 17 mean: 1442us, min:   19us, max: 20680us
Rank 18 mean: 1442us, min:   12us, max: 20860us
Rank 19 mean: 1442us, min:   31us, max: 33240us
Rank 10 mean: 1440us, min:   24us, max: 33659us

Note, that when I have huge numbers, mean is also huge, but in your numbers it is not so.

So,

Try to run your tasks with --cpus-per-task=2 that would increase the space for your threads on Leonardo (each node have 32 cores, I do not know HT status) since Tracy has extra threads (as well as tracy-capture utility threads if you use it). It allow you to avoid overcommiting CPU cores (but I think you did it already)
Print out the whole times data. Looks like you have only few spikes which can be related to thread moving between cores.
Check topology of task (how far are your nodes? Are they in single rack?)
Try to run sequentially with Tracy and without in single job. Even multiple time to be sure that topology and MPI communication from other users are not reasons.

P.S. Cray MPI works a bit different and even reproduces your's numbers sometimes (even on single node) but anyway it looks like you caught some warm-up iterations.

Jan 11 '25 21:01 foxtran

Thanks for trying to reproduce the issue!

I was already pretty confident that there was a causal relationship between enabling Tracy and the spikes I am seeing, but I went ahead and confirmed this again in a larger experiment:

These plots are based on the individual times for each of the 10'000 iterations on each rank, aggregated on the master node and then written out to CSV. Using 32 ranks, on 32 nodes, each rank having all 32 CPUs and 300GB memory exclusively allocated. I did 10 runs for both configurations, alternating between with and without Tracy. All runs were done in a single sbatch script, so the set of nodes does not change, and we can rule out any topology-related effects.

Jan 14 '25 13:01 psalz

Hmm... @psalz, What is your setup of Tracy?

I used that one (note, it uses code for #967):

if(ENABLE_TRACY)
  set(TRACY_LTO ON CACHE INTERNAL "Enable Link-Time optimization")
  include(FetchContent)

  FetchContent_Declare(tracy
    GIT_REPOSITORY https://github.com/foxtran/tracy.git
    GIT_TAG feature/lto
    GIT_SHALLOW TRUE
    GIT_PROGRESS TRUE
  )

  FetchContent_MakeAvailable(tracy)
endif()

for your example code with extra #include <random> to fix compilation with GCC 14.1.

Jan 14 '25 14:01 foxtran

After some more discussion with a colleague, we were wondering if the behavior could be explained by memory allocations happening inside Tracy. I just added time measurements around only the ZoneScopedN itself, and lo and behold, it sometimes spikes to tens of milliseconds -- it could be that I was simply measuring this all the time. As I said, in our distributed runtime we don't create a separate zone for each call to MPI_Test, and I am not 100% confident that the reproducer actually models the same issue that we are seeing there.

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

Could it be that interactions with the Tracy API in one thread cause a memory (re-)allocation that then blocks interactions in other threads? If so, is there maybe a way I could tell Tracy to allocate a large chunk of memory in advance, instead of on-demand?

Jan 14 '25 16:01 psalz

As I can see, Tracy uses RPMALLOC for non-emscripten builds which should cache memory allocations: https://github.com/wolfpld/tracy/blob/da60684b9f61b34afa5aa243a7838d6e79096783/public/common/TracyAlloc.hpp#L6-L11

Jan 14 '25 16:01 foxtran

Welp, I'm an idiot. When I originally tried to reproduce this issue I ported the fiber mechanism from Celerity, and then upon removing it forgot to also remove the set(TRACY_FIBERS ON) from my CMakeLists.txt. Thanks @foxtran for asking what my setup was, because otherwise I wouldn't have gone back and looked.

So your initial hunch was right @wolfpld!

It's also clear now that the spikes don't have anything to do with MPI per-se, but are only being "amplified" by it, because a spike on one rank causes the receive on another rank to take longer.

I ran a couple more experiments, this time also measuring the duration of ZoneScopedN. The second run disables rpmalloc to see whether this has any effect.

So clearly (a) the spikes are because of TRACY_FIBERS=ON, (b) it has something to do with memory allocations, (c) rpmalloc is already doing a good job mitigating a lot of spikes.

The reason we are seeing more spikes in the "per-iteration" times compared to the "ZoneScopedN" times most likely is because the former also measure spikes that happen on neighboring MPI ranks.

So the remaining question is what is happening in Tracy, and if there is anything we can do to mitigate it..?

Jan 15 '25 11:01 psalz

@psalz, you need to cache system call which gets thread ID. Unfortunately, It is happening on each message committing.

Starting from here: https://github.com/wolfpld/tracy/blob/da60684b9f61b34afa5aa243a7838d6e79096783/public/client/TracyProfiler.hpp#L137-L143

Then you have two implementations of GetThreadHandle (TracyProfiler.cpp is too big to clearly understand in which #ifdef I am): https://github.com/wolfpld/tracy/blob/da60684b9f61b34afa5aa243a7838d6e79096783/public/client/TracyProfiler.cpp#L1385 https://github.com/wolfpld/tracy/blob/da60684b9f61b34afa5aa243a7838d6e79096783/public/client/TracyProfiler.cpp#L1325

For one case Tracy has caching, for another Tracy makes extra call: https://github.com/wolfpld/tracy/blob/da60684b9f61b34afa5aa243a7838d6e79096783/public/common/TracySystem.cpp#L61-L99 where system calls happen. And it can be an explanation why it has such spikes.

Hope, it helps :-)

Jan 15 '25 20:01 foxtran

Then you have two implementations of GetThreadHandle (TracyProfiler.cpp is too big to clearly understand in which #ifdef I am):

The one with caching is happening in #ifdef TRACY_DELAYED_INIT, the one without caching is in the #else of that.

You could try to hack in a static thread_local and see if it helps for your use case:

TRACY_API uint32_t GetThreadHandle() {
   static thread_local tid = detail::GetThreadHandleImpl(); 
   return tid;
}

Aug 20 '25 12:08 mcourteaux