samply Add support for profiling on Windows

This PR adds (initial) support for profiling on Windows. It uses the Event Tracing for Windows framework to get very fast, low-overhead captures.

In order to expedite getting something working, this PR calls out to the xperf.exe tool to start/stop tracking and save output to a file. It then parses events from that file, adding them to a profile. This is probably good enough for 95% use cases, but in the future it would be nice to use Etw directly to create an in-memory trace, handling events as they come in. Etw is just such an annoying API to work with that I didn't want to tackle that; I use the ferrisetw crate for event parsing, and it can probably be used for kicking off a kernel trace, but the process for enabling kernel providers is convoluted.

There's a few things that I need to fix:

Sort out Windows administrator privilege elevation. I need administrator privileges for two things: to run xperf, and to query to list & global address of kernel drivers. Right now, I require that samply be run as administrator. But a better solution would be to spawn an elevated child process that can do all the elevated things, while the main process takes care of launching the process to be profiled (right now it's launched as Administrator, not great!)
Implement profiling by pid. Trivial, I just haven't written the code yet. I'll also add support for just profiling the entire system, basically samply start and wait for input to stop profiling. (Or samply start --time 10s).
Sort out some arch bits, right now there's some hardcoded aarch64 which isn't even correct (should be ARM64).
Grab thread names. Double check main-thread rules (I'm not sure if the main thread and the process have the same PID, or if there's a different TID for a process's main thread. I think the latter, but double check.)
Actually test on a system where I can use the profile provider :) ARM64 Windows seems to have a bug where the kernel PROFILE provider just doesn't work; errors out when you try to start. And I'm on a Apple Silicon Mac using Parallels, with no x86_64 (or non-VM ARM64) for another week.

But it works, and works quite well!

Future work:

Add support for JIT events, especially CoreCLR. Luckily, ETW is pretty pervasive on Windows, and CoreCLR delivers Jit events via ETW, so no need to mess around with jitdump files.
A bunch of things that will probably involve helping the profiler front-end to not be so Firefox specific so that we can capture additional data, but I'm scared of the front end code :|

Apr 20 '24 16:04 vvuk

I suppose this would also fix #77.

Apr 20 '24 16:04 vvuk

Nice work! We have another implementation at https://github.com/jrmuizel/etw-profiler/tree/main/etw-gecko that we've been using to profile Firefox. I should have added a comment to #77 pointing to our implementation, so I'm sorry if I've caused you to do extra work by not doing so. However, given the state of ETW documentation, I'd say having multiple implementations is probably a good thing.

For Administrator privilege elevation I was planning to use the runas crate.

Apr 20 '24 17:04 mstange

A bunch of things that will probably involve helping the profiler front-end to not be so Firefox specific so that we can capture additional data, but I'm scared of the front end code :|

If you file issues in the Firefox profiler repo, I can add them to the "General-purpose profiler" list.

Apr 20 '24 17:04 mstange

I should have added a comment to https://github.com/mstange/samply/issues/77 pointing to our implementation, so I'm sorry if I've caused you to do extra work by not doing so.

No problem, I really should've gone looking first. This was also a learning exercise, and I need to do CoreCLR ETW events anyway. From a quick glance at that crate, it's focused on Firefox profiling (understandably), and has lots of logic for SpiderMonkey etc. How do you want me to proceed here -- would you accept this PR as a "lightweight" generic Windows implementation into samply, and keep the firefox etw-profiler/etw-gecko stuff as-is separate?

Apr 20 '24 17:04 vvuk

The original plan was to just dump the etw-gecko + etw-reader code into this repo as-is (plus run cargo fmt and fix the warnings + clippy), and then change it to use more of the Linux importer's infrastructure, e.g. crate::linux_shared::processes::Processes and everything that comes attached with that, to get support for the --reuse-threads option.

There's not a ton of Firefox-specific stuff in there. There's a little bit for Firefox-specific ETW trace event support. The JIT stuff supports both Spidermonkey and V8, and also everything else that happens to use Microsoft-JScript/MethodRuntime (I don't know if there are any other users). And all the files other than main.rs are already part of samply/src/linux_shared.

The next step in the original plan was to stop using xperf and do the recording ourselves, by starting with https://github.com/jrmuizel/etw-profiler/blob/ce597a3e1d54181a3fd6a9b4a249e15e620643bb/etw-reader/examples/record.rs and adding everything needed to make it work.

Jeff also did a lot of work on the etw-reader side to correctly parse ETW trace events coming out of Windows and out of Chrome. The etw-reader code has diverged quite a bit from the original ferrisetw code, and most of the changes were about fixing various parse errors and about adding support for more (undocumented) event types, as far as I understand.

I need to look at your implementation more closely before I can decide how to proceed. I don't feel strongly about which code to start with (you could say the etw-gecko code isn't the cleanest), as long as we eventually end up with something that isn't worse than either implementation.

Based on a quick glance at your implementation, here are a few things where the etw-gecko implementation is more complete:

The LibraryInfo data is read from KernelTraceControl/ImageID, KernelTraceControl/ImageID/DbgID_RSDS, and MSNT_SystemTrace/Image events, without reading the binary file. This probably only works because xperf does its own post-processing of the trace during the merging process, and we'll likely have to read the binary file ourselves once we stop using xperf.
I've spent a lot of effort pairing up kernel and user stacks in a way that I think is mostly correct now.
We correctly compute CPU deltas and off-cpu samples based on CSwitch information (if you run with -stackwalk profile+cswitch).

Apr 20 '24 18:04 mstange

Yeah, etw-reader/etw-gecko is definitely more complete. I also don't feel strongly about my implementation vs etw-gecko -- ultimately I just want something that works (and roll out internally to everyone, I'm tired of the terrible cross-platform profiling story). My overall implementation is really dead-simple. I haven't even profiled it, and I know there are probably lots of easy perf improvements (e.g. repeated parsing of the same event structure annoys me).

Good to know about KernelTraceControl events -- I didn't know about those, and the library ID stuff was annoying (due to the asyncness), I'll add that. I'll take a look at the CSwitch info as well. For kernel + user stack lining up -- tell me more? I'll take a look at the code, but I assumed that that merging already happened into a single stack.

For xperf vs native... is that really worth it? The only advantage that I really see in not using xperf is not needing to have it available, and not needing additional disk space for the etl file. The ability to process events as they come in instead of via intermediate file seems like a mixed bag. It would mean that the event processing must be as lightweight as possible, because it'll be happening during the trace itself, and overall much more code to maintain. (A good reason might be if there were plans to support a "live" profiling display, but I don't think the JSON profiling format can really support that?) Seems like a nice to have, but not something that I'd wait on landing windows support for (either this or via etw-reader).

Apr 20 '24 18:04 vvuk

For kernel + user stack lining up -- tell me more? I'll take a look at the code, but I assumed that that merging already happened into a single stack.

The kernel and user parts of the stack come in separate stack events. And sometimes you can stay in the kernel for multiple samples, and then once you get back out into user-land, a single user stack is captured which applies to all the previously-taken kernel stacks. At least that's what I think is happening; this assumption has given us the most reasonable looking profiles so far.

For xperf vs native... is that really worth it? The only advantage that I really see in not using xperf is not needing to have it available

This is the part that I care about the most. I don't like asking users to install other software first. The other part I care about is reducing the post-processing time.

The ability to process events as they come in instead of via intermediate file seems like a mixed bag. It would mean that the event processing must be as lightweight as possible, because it'll be happening during the trace itself

True. Though for lower-overhead profiling, samply import trace.etl would still remain an option. The Linux backend also has this overhead already (and more, because it asks the kernel to copy raw stack bytes around).

and overall much more code to maintain.

Some, sure.

Seems like a nice to have, but not something that I'd wait on landing windows support for (either this or via etw-reader).

Oh I agree, the first shipping version should use xperf.

(A good reason might be if there were plans to support a "live" profiling display, but I don't think the JSON profiling format can really support that?)

No plans at the moment.

Apr 20 '24 18:04 mstange

I dug into etw-gecko a bit more. It's a ton more complete. Some quick thoughts, there's two things I like about my implementation better, though they're all surface/structural stuff:

I kinda like how I did the event parsing in https://github.com/mstange/samply/pull/141/files#diff-a399eae2dd67b64464441b07522fc74f2ee27f68830cc92cb2019646609b9bd3 more than writing it all out in the big switch statement. Not every event can be handled that way though, there's some with a lot of logic that I looked into less.
I did the path song-and-dance to convert NT paths to DOS ones. (It's not 100% complete, won't handle network stuff, but I should just do the GLOBALROOT stuff for those)

But it would be pretty quick to turn etw-gecko into a library and just plug it into the samply integration code that's here, replacing my parsing. Thoughts? I'm happy to do that work if you or @jrmuizel weren't going to get to it soon. Then we can do some refactoring of the etw-gecko code. (Would also be nice to move it into this repo once that happens for easier future work, but depends if you two would be up for that)

Apr 21 '24 20:04 vvuk

I'm happy to do that work if you or @jrmuizel weren't going to get to it soon. Then we can do some refactoring of the etw-gecko code. (Would also be nice to move it into this repo once that happens for easier future work, but depends if you two would be up for that)

That would be great! I would like to have it in this repo. And I think it should just be part of the samply package for now, so that we can easily change it without having to release anything separately to crates.io. I was going to get to it in two weeks or so, but if you get to it before then, please go ahead!

Apr 21 '24 20:04 mstange

I did the path song-and-dance to convert NT paths to DOS ones. (It's not 100% complete, won't handle network stuff, but I should just do the GLOBALROOT stuff for those)

I was really hoping there'd be a crate that does this, and it seems like dunce would be the one that you'd expect to do it, but it doesn't simplify those paths. Here's an issue asking about it.

Apr 21 '24 21:04 mstange

samply samply copied to clipboard

Add support for profiling on Windows

samply
samply copied to clipboard