tracy icon indicating copy to clipboard operation
tracy copied to clipboard

Refactoring ETW kernel session

Open slomp opened this issue 3 weeks ago • 9 comments

Switching from the shared (singleton) "NT Kernel Logger" over to private Kernel logging session (Windows 8+). VSync events have also been unified under the same private session and event record callback.

I'll be adding CPU frequency/throttling events in the future.

slomp avatar Dec 02 '25 00:12 slomp

As part of this refactoring, I am also inclined to move the following event definitions: https://github.com/wolfpld/tracy/blob/886a0abec9f5267f2bf145eb698bac80df80b27a/public/client/TracySysTrace.cpp#L83-L141 into the newly introduced TracyETW.cpp file. Any objections?

slomp avatar Dec 04 '25 19:12 slomp

These structs are used in EventRecordCallback. How will this work now?

wolfpld avatar Dec 04 '25 21:12 wolfpld

These structs are used in EventRecordCallback. How will this work now?

The EventRecordCallback would just reference them through the tracy::etw namespace (e.g. etw::CSwitchEvent). I am also thinking about defining the "magic" opcode numbers as static constexpr members of these structs:

if( hdr.EventDescriptor.Opcode == etw::CSwitchEvent::Opcode )
{
    const auto cswitch = (const etw::CSwitchEvent*)record->UserData;
    ...
}

slomp avatar Dec 04 '25 22:12 slomp

As part of this refactoring, I am also inclined to move the following event definitions:

https://github.com/wolfpld/tracy/blob/886a0abec9f5267f2bf145eb698bac80df80b27a/public/client/TracySysTrace.cpp#L83-L141

into the newly introduced TracyETW.cpp file. Any objections?

None for me, centralizing ETW stuff might even be useful in the future if we ever want to reuse those to allow importing ETL captures

Lectem avatar Dec 05 '25 12:12 Lectem

I moved the events to TracyETW.cpp and added some comments and some alignment/size checks: https://github.com/wolfpld/tracy/pull/1213/commits/7b5de9e2b9ae74ff1efd3d0453a4aee837dcade8

We could add some runtime checks during system trace initialization to query the events (and respective fields) we care about via the TDH API, but that's going to be a bit annoying to do.

slomp avatar Dec 05 '25 20:12 slomp

(don't merge it yet; there seems to be a quirk with VSync I need to investigate)

slomp avatar Dec 05 '25 20:12 slomp

Actually, never mind... I thought there was something funny going on with VSync event setup, but it turns out VSync events are just funky. Sometimes, out of nowhere, I stop receiving them, and that's the case with the current master branch as well. After a while, I start receiving them again.

slomp avatar Dec 06 '25 03:12 slomp

@Lectem About the VSync issue is associated with "best-effort" guarantees that ETW makes regarding high volume events. ETW will happily let you enable a provider in several sessions, and will deliver events the best it can, but for high volume events (and apparently, VSync is considered high-volume enough), it won't guarantee that such events will be logged to all sessions where the provider is enabled.

This also explains why I was getting DxgKrnl events, just not the VSync ones. I found out I still had a "Tracy VSync" and an "NT Kernel Logger" that enabled DxgKrnl provider, and after stopping them, I started to get VSync events. It also explains why I would also not receive VSync events occasionally rom the "Tracy VSync" in master (another session I was experimenting with was still running).

slomp avatar Dec 08 '25 16:12 slomp

@Lectem About the VSync issue is associated with "best-effort" guarantees that ETW makes regarding high volume events. ETW will happily let you enable a provider in several sessions, and will deliver events the best it can, but for high volume events (and apparently, VSync is considered high-volume enough), it won't guarantee that such events will be logged to all sessions where the provider is enabled.

This also explains why I was getting DxgKrnl events, just not the VSync ones. I found out I still had a "Tracy VSync" and an "NT Kernel Logger" that enabled DxgKrnl provider, and after stopping them, I started to get VSync events. It also explains why I would also not receive VSync events occasionally rom the "Tracy VSync" in master (another session I was experimenting with was still running).

For the future, we need to figure out a way to query whether a (high-volume) provider is enabled in another concurrent session or not. If you have ideas on how to do that, let me know.

slomp avatar Dec 08 '25 16:12 slomp

For the future, we need to figure out a way to query whether a (high-volume) provider is enabled in another concurrent session or not. If you have ideas on how to do that, let me know.

It looks like what we need is EnumerateTraceGuidsEx() with the TraceGuidQueryInfo argument. We'll get back a TRACE_PROVIDER_INSTANCE_INFO struct with an EnableCount field, and an array (of that size) of TRACE_ENABLE_INFO structs with information about each enabled instance, including the MatchAny|AllKeyword filters.

slomp avatar Dec 09 '25 20:12 slomp

It looks like what we need is EnumerateTraceGuidsEx() with the TraceGuidQueryInfo argument. We'll get back a TRACE_PROVIDER_INSTANCE_INFO struct with an EnableCount field, and an array (of that size) of TRACE_ENABLE_INFO structs with information about each enabled instance, including the MatchAny|AllKeyword filters.

Here's what I got from my system:

DxgKrnl GUID: 802EC45A-1E99-4B83-9920-87C98277BA9D
EnableCount: 3
    LoggerId: 10 | Name: EventLog-System  | Level: 255 | Any: 0x0C00000000000000 | All: 0x0000000000000000
    LoggerId: 7  | Name: DiagLog          | Level: 255 | Any: 0x0000000002030000 | All: 0x0000000000000000
    LoggerId: 44 | Name: FrameViewService | Level: 4   | Any: 0x0000000008000001 | All: 0x0000000000000001

The first two are from Windows itself, but they don't collect VSync events (judging by the Match keyword flags). The last one, FrameViewService, is from NVIDIA, and is monitoring VSync events.

Experimentally, it looks like only two sessions at most will end up receiving VSync events.

slomp avatar Dec 10 '25 05:12 slomp

Interesting, so if you use both PresentMon and Tracy at the same time for example, things will be dropped?

Lectem avatar Dec 10 '25 06:12 Lectem

Interesting, so if you use both PresentMon and Tracy at the same time for example, things will be dropped?

That's the thing: maybe. I am not sure how these limits apply. For example, I was able to start 6 private kernel sessions concurrently and receive VSync events in all of them (I actually ran out of kernel sessions, the system-wide limit seems to be 8).

But other times, I simply was not getting any VSync event, yet was getting all other DxgKrnl events. Seems pretty random and arbitrary, and I don't think there's an easy (or any) way to control or circumvent it. I think this is rare enough that it should not matter. It could also be tied to the mixture of other high-volume event providers that the concurrent sessions are collecting.

slomp avatar Dec 10 '25 18:12 slomp

I added some code to detect (and log) whether a provider has been enabled by another session elsewhere.

I think this concludes all the review changes and quirks I encountered.

slomp avatar Dec 10 '25 21:12 slomp