tracy icon indicating copy to clipboard operation
tracy copied to clipboard

Question about time spent between zones

Open caioraphael1 opened this issue 3 weeks ago • 6 comments

hello, I'm a bit confused about Tracy's performance. For the demo below, I'm getting the performance in the image:

{
    tracy.ZoneN("1")
}

{
    tracy.ZoneN("2")
}

{
    tracy.ZoneN("3")
}
Image Image

This is a snippet for Tracy in the Odin Language. When exiting a scope, zone end is automatically called. This is the implementation of ZoneBegin in the binding:

ZoneBegin :: proc(active: bool, depth: i32, loc := #caller_location) -> (ctx: ZoneCtx) {
	when TRACY_ENABLE {
		id := ___tracy_alloc_srcloc(u32(loc.line), _sl(loc.file_path), _sl(loc.procedure))
		when TRACY_HAS_CALLSTACK {
			ctx = ___tracy_emit_zone_begin_alloc_callstack(id, depth, b32(active))
		} else {
			ctx = ___tracy_emit_zone_begin_alloc(id, b32(active))
		}
	}
	return
}

I'm using Tracy 0.10.0 in this example, as the binding has not yet been updated for later versions. I'm on Windows 10, GTX 1660, 32GB RAM, Ryzen 3 3300x.

The code is compiled in release, as well as Tracy. I'm confused about how I can get such gap of ~470us between calls, when the Tracy call itself takes 30-40ns. In the 1.7 section in the Tracy docs, it's stated that "Tracy achieves such small overhead (only 2.25 ns)". I'd like to understand a bit more about this performance hit.

Thank you very much.

caioraphael1 avatar Dec 01 '25 18:12 caioraphael1

Are you running your program in Debug/unoptimized mode? That's the first thing I'd check. If the unexpected large gaps are still there in Release/optimized builds, I'd check the assembly that Odin is producing.

Just made a quick test here in C++:

#define TRACY_ENABLE
#include <path-to-tracy/public/tracy/Tracy.hpp>
#include <path-to-tracy/public/TracyClient.cpp>

int main(int argc, char* argv[]) {

    ZoneScoped;

    std::this_thread::sleep_for(std::chrono::milliseconds(10));

    {
        ZoneScopedN("1");
    }

    {
        ZoneScopedN("2");
    }

    {
        ZoneScopedN("3");
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(500));

    return 0;
}

and got this: Image

Not sure why Zone "1" took longer than the others (but it is happening consistently; @wolfpld ideas?). Other than that, zones 2 and 3 are on the same ballpark as the gaps between them (~20ns).

As for the "Tracy achieves such small overhead (only 2.25 ns)", well, there's a lot of nuance here (with rdtsc and such), but put simply, on a modern x64 machine, an "empty" zone should be between 10ns to 50ns.

slomp avatar Dec 03 '25 18:12 slomp

hello! My code is in release mode, same for Tracy. I was skeptical whether Odin was making something weird internally, but people from the Odin discord suggested that maybe this is related to the allocation made from the ___tracy_alloc_srcloc, as suggested in the Tracy docs for when using the C bindings. I also noticed that the first zone was taking longer as well, consistently. You can see how zone 1 was taking longer than zone 2 and 3 in my first image in the post. Thank you for sharing the info of ~20ns for the gaps; tho honestly I thought the gap would be much smaller than that. The smallest gap I got in my machine was around ~200ns, but for this simple demo I'm getting around 400-450ns. If possible, could you check if your results change if you decide to use the C binding and allocate with ___tracy_alloc_srcloc? This probably means using ___tracy_emit_zone_begin_alloc instead of ___tracy_emit_zone_begin Thank you :)

caioraphael1 avatar Dec 03 '25 18:12 caioraphael1

but people from the Odin discord suggested that maybe this is related to the allocation made from the ___tracy_alloc_srcloc

Makes sense. Source location information in most Tracy Zone annotations is handled at compile-time and are stored in static, read-only data blocks in the program binary, so no dynamic allocation needs to happen.

slomp avatar Dec 03 '25 19:12 slomp

it is happening consistently; @wolfpld ideas?

Queue block allocation cost is amortized.

wolfpld avatar Dec 03 '25 19:12 wolfpld

@caioraphael1 I just tried emulating your Odin wrapper with this:

struct CZoneScope {
    TracyCZoneCtx ctx;
    CZoneScope(const char* name, int line, const char* filename, const char* function, int depth) {
        uint64_t srcloc = ___tracy_alloc_srcloc_name(line, filename, strlen(filename), function, strlen(function), name, strlen(name), 0);
        ctx = ___tracy_emit_zone_begin_alloc_callstack(srcloc, depth, 1);
    }
    ~CZoneScope() {
        ___tracy_emit_zone_end(ctx);
    }
};

#define OdinZoneScopedN(name) CZoneScope odinzone (name, __LINE__, __FILE__, __FUNCTION__, 16)

The thing taking most of the time in your gaps are not the allocations, but the fact that you are requesting a callstack to go with the zone. For example, with a depth of 16, the gaps grow to hundreds of nanoseconds here for me; with a depth of 0, the gap is only about 50ns (and note that in my snippet above, the overhead also accounts for the strlen() calls).

slomp avatar Dec 03 '25 19:12 slomp

@wolfpld I wonder if there's a clean way to visualize the "overhead" of the callstack capture (or transient allocations) in the zone (we'd be getting 3 timestamps instead of 2, and maybe coloring the space between the first two timestamps in a different color).

slomp avatar Dec 03 '25 19:12 slomp

@caioraphael1 Does this solve your issue?

slomp avatar Dec 04 '25 17:12 slomp

@wolfpld I think we can close this issue now.

slomp avatar Dec 05 '25 17:12 slomp

hello, I'm stuck with work right now, but I plan to dig deeper into this this weekend. I haven't go the time yet to test this yet. I also received some suggestions around adapting the binding in a way to avoid the extra allocations.

caioraphael1 avatar Dec 05 '25 17:12 caioraphael1

While disabling the call stack, I got around 80-200ns between zones, sometimes down to 30ns. This concludes that the problem was indeed the call stack (I was using 2 before), and this is probably further improved if not using the C API and avoiding the extra allocations. I check if it's possible to do something similar to the macro used, but in Odin, as the language doesn't offer much support for meta language :/ Thank you for the ideas and for the answers!

caioraphael1 avatar Dec 06 '25 13:12 caioraphael1