jmh icon indicating copy to clipboard operation
jmh copied to clipboard

7903722: JMH: Add xctrace-based perfnorm profiler for macOS

Open fzhinkin opened this issue 1 year ago • 9 comments
trafficstars

Implementation of a perfnorm-alike profiler for macOS based on xctrace command line tool bundled with Xcode.

While the profiler is tested and seems to be working well, I consider it rather a preliminary version and open to a discussion on what and how it should measure.

Currently, the profiler only supports PMU counters sampling using CPU Counters instrument provided by the Instruments app / xctrace. Unfortunately, CPU Counters instrument has no default settings, unlike Time Profiler and CPU Profiler instruments used by the recently merged xctraceasm profiler. To use CPU Counters, a user has to create a template in the Instruments UI, select PMU events, save the template and then supply to xctracenorm as an argument.

This workflow not only prevents use of the profiler without preliminary manual configuration, but also tends to be annoying when it comes to measuring multiple events, as xctrace, unlike perf_events, does not support events multiplexing.

Thankfully, command-line-based configuration and default parameters could be emulated by building a custom Instruments package that imports data from CPU Counters and also supplies all required parameters. As you can guess, there's no way to get information about supported PMU events directly from xctrace, but it could be fetched from KPEP database files, stored is /usr/share/kpep. xctracenorm relies on that data to validate events specified by a user, if any, and also to print a help message that gives some insights into what could be sampled.

To sum up, there are a few things that were implemented to make xctracenorm profiler works:

  • CPU model deletion using sysctl;
  • KPEP file parsing to extract information about the PMU and all supported events;
  • selected performance events validation;
  • Instruments package building (generate XML, call a builder tool), packages are cached in ~/Library/Caches/org.openjdk.jmh;
  • xctrace execution, resulting samples extraction, and aggregation;
  • samples postprocessing to calculate some additional metrics, like CPI and branch missprediction ratio.

Currently, if a user didn't specify any additional options, xctracenorm will sample instructions, cycles, branches and mispredicted branches events. These were selected as events that should be supported in all hardware macOS runs on; only 4 events were selected for the same reason.

Profiling results look like this on M2-based MacBook:

java -jar ./benchmarks.jar -prof xctracenorm -f 1 JMHSample_35_Profilers.Atomic
...
Benchmark                                                                 Mode  Cnt   Score   Error                               Units
JMHSample_35_Profilers.Atomic.test                                        avgt    5   4.055 ± 0.185                               ns/op
JMHSample_35_Profilers.Atomic.test:BRANCH_MISPRED_NONSPEC                 avgt       ≈ 10⁻⁵                                        #/op
JMHSample_35_Profilers.Atomic.test:Branch miss ratio                      avgt       ≈ 10⁻⁶          BRANCH_MISPRED_NONSPEC/INST_BRANCH
JMHSample_35_Profilers.Atomic.test:CORE_ACTIVE_CYCLE                      avgt       10.541                                        #/op
JMHSample_35_Profilers.Atomic.test:CPI                                    avgt        0.351                  CORE_ACTIVE_CYCLE/INST_ALL
JMHSample_35_Profilers.Atomic.test:INST_ALL                               avgt       30.031                                        #/op
JMHSample_35_Profilers.Atomic.test:INST_BRANCH                            avgt        3.850                                        #/op
JMHSample_35_Profilers.Atomic.test:INST_BRANCH density (of instructions)  avgt        0.128                        INST_BRANCH/INST_ALL
JMHSample_35_Profilers.Atomic.test:IPC                                    avgt        2.849                  INST_ALL/CORE_ACTIVE_CYCLE

Here are some alternatives to existing implementation (of default parameters mode, mostly):

  • Create an Instruments template and bundle it with JMH instead of generating a package dynamically: templates have a proprietary and obscure format, it's unclear if a template created on one device will work on other devices, or whether it continues working well after Xcode update. Also, dynamic package generation facilitates PMU events selection in CLI when running benchmarks, which seems to be much more convenient that setting up templates in UI.
  • Don't parse KPEP files: data from these files used for validation, and also allows to gather info about what could be sampled without going to a separate tool. Validation is necessary as xctrace simply crashes when something is wrong with selected events.

Open questions and things that require some fixes:

  • [ ] Currently, only PMU sampling was supported, however, profiler may also sample some software events like context switches, virtual memory operations and syscall statistics. It's unclear if all that should ever be supported.
  • [ ] Some parts of the profiling process, namely KPEP parsing and Instruments package building are covered by tests in jmh-core-it module (XCTraceSupportTest). Tests functions are package-private in jmh-core, so to test them on it module I had to call everything through reflection, which definitely doesn't look good. I'm not sure how to both keep the API private and test in another module, so I'm looking forward to any advice on that.
  • [ ] CPU Counters doesn't work inside VMs (at least, because there are no kpep-files for VM CPU ids, so xctrace can't load those files to fetch info about PMU events; and even if a KPEP file is there, there's no way to sample PMU counters inside VM), so all positive profiler's tests are currently skipped inside GH agents. I'm not sure what could be done here to improve testability.
  • [x] When running tests locally, surefire forks used in jmh-core-it cause interference between xctraceasm and xctracenorm tests leading to test failures. I'm currently looking into how to overcome the issue and serialize these tests' execution. In the worst case, tests for both profilers could be placed in a single test class, I guess.

Speaking of testing, I manually ran basic scenarios (listing profilers, printing a help message, listing supported events, and, finally, profiling) on Intel-, M1- and M2-based MacBooks. Results and a script I used to collect data could be found here: https://gist.github.com/fzhinkin/f7c5db00f5e3417191d66994ed880818


Progress

  • [x] Change must not contain extraneous whitespace
  • [ ] Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • CODETOOLS-7903722: JMH: Add xctrace-based perfnorm profiler for macOS (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jmh.git pull/131/head:pull/131
$ git checkout pull/131

Update a local copy of the PR:
$ git checkout pull/131
$ git pull https://git.openjdk.org/jmh.git pull/131/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 131

View PR using the GUI difftool:
$ git pr show -t 131

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jmh/pull/131.diff

Webrev

Link to Webrev Comment

fzhinkin avatar May 04 '24 19:05 fzhinkin

:wave: Welcome back fzhinkin! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar May 04 '24 19:05 bridgekeeper[bot]

@fzhinkin This change now passes all automated pre-integration checks.

After integration, the commit message for the final commit will be:

7903722: JMH: Add xctrace-based perfnorm profiler for macOS

Reviewed-by: shade

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@shipilev) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk[bot] avatar May 04 '24 19:05 openjdk[bot]

Webrevs

mlbridge[bot] avatar May 07 '24 06:05 mlbridge[bot]

Here are some alternatives to existing implementation (of default parameters mode, mostly): ...

  • Don't parse KPEP files: data from these files used for validation, and also allows to gather info about what could be sampled without going to a separate tool. Validation is necessary as xctrace simply crashes when something is wrong with selected events.

On the other hand, KPEP parsing and PMU events validation logic is a fair share of the whole PR, thus removing it should make the implementation much simpler.

As a middle ground, the validation could be reduced to checking event names only (and that could be done without parsing KPEP XML file, by running plutil -extract system.cpu.events raw -o - <path.to.plist>).

@shipilev please let me know which option (if any at all) suits JMH better.

fzhinkin avatar May 25 '24 08:05 fzhinkin

@fzhinkin This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Jun 22 '24 14:06 bridgekeeper[bot]

Gentle ping to preserve the PR from being closed.

fzhinkin avatar Jun 24 '24 15:06 fzhinkin

@fzhinkin This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Jul 22 '24 21:07 bridgekeeper[bot]

Great work @fzhinkin! I've been trying this out and thought I'd share my experience:

  • xctraceasm seems to be working well, but some characters do not appear to be printed well:
   0.03%  ??? ??? ?   0x00000001168cf5d8:   sbfiz		x10, x6, #2, #0x20  ;*iaload {reexecute=0 rethrow=0 return_oop=0}

The ? characters should instead be the vertical lines that bind together loops.

  • xctracenorm not working for me, it stops with error:
Can't create an Instruments package:

Finished running test ...

I do have instrumentbuilder locally but any calls to it get killed:

$ ls -l /Applications/Xcode.app/Contents/Developer/usr/bin/instrumentbuilder
-rwxr-xr-x  1 galder  staff   184K  6 Oct  2023 /Applications/Xcode.app/Contents/Developer/usr/bin/instrumentbuilder

$ /Applications/Xcode.app/Contents/Developer/usr/bin/instrumentbuilder -h
[1]    98880 killed     /Applications/Xcode.app/Contents/Developer/usr/bin/instrumentbuilder -h

I'm on Sonoma 14.5.

galderz avatar Jul 25 '24 04:07 galderz

The instrumentbuilder error is gone with an XCode upgrade.

galderz avatar Jul 25 '24 12:07 galderz

@galderz thanks for trying profilers and sorry for the late reply!

I'll check what may cause issues with instrumentbuilder and try to improve error reporting.

Is the problem with ? being printed instead of | is still reproduces? Could you please share the details on how to reproduce it? For me, everything is printed correctly given a some random benchmark, zsh w/ default locale settings and macOs 14.4.

fzhinkin avatar Aug 12 '24 21:08 fzhinkin

The issue with instrumentbuilder was due to an old xcode version. Updated xcode and then it worked fine.

galderz avatar Aug 15 '24 13:08 galderz

@fzhinkin This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Sep 12 '24 14:09 bridgekeeper[bot]

@shipilev hey! Does it make any sense to keep this PR open (and continuing working on it)? I guess, the immense amount of changes blocks this PR from being reviewed in a reasonable time frame. I'd love to throw away as much code as possible (see https://github.com/openjdk/jmh/pull/131#issuecomment-2131129163; also Summary's "Here are some alternatives to existing implementation (of default parameters mode, mostly)" portion), but to keep the profiler as user-friendly as, for instance, perfnorm, some portion of logic gluing all parts together have to be preserved.

Please let me know what would work better (if anything).

fzhinkin avatar Sep 16 '24 15:09 fzhinkin

Hi @fzhinkin! Sorry, the delays are my fault, I keep getting distracted to other things, and this is a huge PR.

I think we should indeed consider trimming down things that we do here. This is an impressive piece of work, but I cannot help but think how clunky this MacOS profiling interface really is. Building a package to configure a profiler, wow, what a hassle.

If we commit to do an extended thing here, it would be harder to retract. If we start small, maybe we do not really need to extend it any in future. So I think we should be doing a favor to ourselves here, and only implement a very basic support: no building packages, no selectable events, a preconfigured template (which we might later conditionalize on xcode version?) with most useful events, an ability to substitute your own template (for power users). Yes, it would not be as great as perfnorm, but that's the tradeoff we make not to deal with the profiling interface clunkiness.

Does that work?

To hopefully answer the specific questions:

Currently, only PMU sampling was supported, however, profiler may also sample some software events like context switches, virtual memory operations and syscall statistics. It's unclear if all that should ever be supported.

No need to make it even more complicated. Scope it out of this PR.

Some parts of the profiling process, [...] Tests functions are package-private in jmh-core, so to test them on it module I had to call everything through reflection, which definitely doesn't look good. I'm not sure how to both keep the API private and test in another module, so I'm looking forward to any advice on that.

Reflection is fine for now. If this becomes a real burden, we can rethink how tests are factored.

CPU Counters doesn't work inside VMs [...], so all positive profiler's tests are currently skipped inside GH agents. I'm not sure what could be done here to improve testability.

Skipping tests in GHA if GHA runners cannot support them is fine.

shipilev avatar Sep 20 '24 18:09 shipilev

@shipilev, thank you for taking a look and addressing the questions!

So I think we should be doing a favor to ourselves here, and only implement a very basic support: no building packages, no selectable events, a preconfigured template (which we might later conditionalize on xcode version?) with most useful events, an ability to substitute your own template (for power users). Yes, it would not be as great as perfnorm, but that's the tradeoff we make not to deal with the profiling interface clunkiness.

Does that work?

Absolutely! I'll come back by with a trimmed down version soon.

fzhinkin avatar Sep 20 '24 19:09 fzhinkin

@fzhinkin Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

openjdk[bot] avatar Sep 21 '24 20:09 openjdk[bot]

I updated the profiler so it no longer generates Instruments packages. Instead, users need to specify Instruments template name or path. If template was not specified, a default template will be used instead. It samples four PMU counters which, through aliased names, should be available on all supported CPUs.

I tested the profiler with Instruments versions from 13.0 to 16.0. The default template works fine with any of them, on both x86-64 and aarch64 devices.

fzhinkin avatar Sep 21 '24 21:09 fzhinkin

@shipilev

I tried on my Mac, and I think profiler requires super-user privileges? I only succeeded when running sudo java ....

It should not require super-user privileges. Otherwise, xctraceasm will require them too. Could you please elaborate on what happened w/o SU-privileges?

fzhinkin avatar Sep 23 '24 16:09 fzhinkin

The large CORE_ACTIVE_CYCLE/INST_ALL are not great: too long? See how Linux perfnorm reports them, e.g. clks/insns.

I'll use aliases similar to what perf_events provides, then.

fzhinkin avatar Sep 23 '24 16:09 fzhinkin

I tried on my Mac, and I think profiler requires super-user privileges? I only succeeded when running sudo java ....

It should not require super-user privileges. Otherwise, xctraceasm will require them too. Could you please elaborate on what happened w/o SU-privileges?

Yeah, xctraceasm does not work either. This might be a borked setup on my side, really.

$ java -jar jmh-samples/target/benchmarks.jar -prof xctracenorm -f 1 JMHSample_35_Profilers.Atomic -v EXTRA
...

# Fork: 1 of 1
# Preparing profilers: XCTraceNormProfiler 
[-[SimServiceContext sendRequest:]:1764] ERROR : Unable to deliver request ({
    "developer_dir" = "/Applications/Xcode.app/Contents/Developer";
    request = "set_developer_dir";
}) because we are not connected to CoreSimulatorService.
[-[SimServiceContext initWithDeveloperDir:connectionType:error:]:450] WARN  : Unable to discover any Simulator runtimes. Developer Directory is /Applications/Xcode.app/Contents/Developer.
[-[SimDiskImageManager init]:146] ERROR : Could not kickstart simdiskimaged; SimDiskImageManager services will not be available: Error Domain=NSPOSIXErrorDomain Code=53 "Software caused connection abort" UserInfo={NSLocalizedDescription=Error returned in reply from CoreSimulatorService: Connection invalid}
[-[SimServiceContext sendRequest:reply:error:]:1784] ERROR : Unable to deliver request ({
    request = "notification_subscription";
    "set_path" = "/Users/shipilev/Library/Developer/CoreSimulator/Devices";
}) because we are not connected to CoreSimulatorService.
Starting recording with the jmh14925377671874505903extracted template. Launching process: java. 
xpc connection invalid: com.apple.dt.instruments.dtsecurity.xpc
Ctrl-C to stop the recording
Run issues were detected (trace is still ready to be viewed):
* [Error] Unexpected failure: Couriers have returned unexpectedly.

* [Error] Failed to acquire kernel trace recording resources (Connection refused). Possibly in use by pid 53875.

* [Error] UnexRepceocrtdeidn gf afialiulreed:  wDiattha  esroruorrcse.  aSgaevnitn gf aoiultepdu tt of ialrem...

.
Output file saved as: Launch_java_2024-09-23_18.59.50_6B677147.trace
<forked VM failed with exit code 2>

...and pid 53875 is tailspind:

    0 53875     1   0  5:45PM ??         0:00.02 /usr/libexec/tailspind

shipilev avatar Sep 23 '24 17:09 shipilev

Abbreviated units, not it looks like:

Benchmark                                                          Mode  Cnt   Score   Error      Units
JMHSample_35_Profilers.Atomic.test                                 avgt    5   1.885 ± 0.006      ns/op
JMHSample_35_Profilers.Atomic.test:BRANCH_MISPRED_NONSPEC          avgt       ≈ 10⁻⁵               #/op
JMHSample_35_Profilers.Atomic.test:Branch miss ratio               avgt       ≈ 10⁻⁵           miss/brs
JMHSample_35_Profilers.Atomic.test:CORE_ACTIVE_CYCLE               avgt        6.988               #/op
JMHSample_35_Profilers.Atomic.test:CPI                             avgt        0.584          clks/insn
JMHSample_35_Profilers.Atomic.test:INST_ALL                        avgt       11.970               #/op
JMHSample_35_Profilers.Atomic.test:INST_BRANCH                     avgt        1.995               #/op
JMHSample_35_Profilers.Atomic.test:INST_BRANCH density (of insns)  avgt        0.167             #/insn
JMHSample_35_Profilers.Atomic.test:IPC                             avgt        1.713          insns/clk

fzhinkin avatar Sep 23 '24 22:09 fzhinkin

Yeah, xctraceasm does not work either. This might be a borked setup on my side, really.

I had a similar issue when a custom kernel trace client started w/ SU-privileges and died without releasing ktrace. Long story short, rebooting the device helped (~~xctrace start running, but my system-programming abilities didn't improve~~).

Messages about device simulator looks suspicious, maybe xctrace list devices will shed some light (perhaps, there are no devices available, due to ktrace being locked by another process, so xctrace decided to run on an emulator?).

fzhinkin avatar Sep 23 '24 22:09 fzhinkin

Thank you, @shipilev!

fzhinkin avatar Sep 24 '24 23:09 fzhinkin

/integrate

fzhinkin avatar Sep 24 '24 23:09 fzhinkin

@fzhinkin Your change (at version 7e5602d942691ea52c3a460364e1af8eacf7c71d) is now ready to be sponsored by a Committer.

openjdk[bot] avatar Sep 24 '24 23:09 openjdk[bot]

/sponsor

shipilev avatar Sep 25 '24 06:09 shipilev

Going to push as commit c2931c90fd4079271b69a8a393dfe277634f3e52.

openjdk[bot] avatar Sep 25 '24 06:09 openjdk[bot]

@shipilev @fzhinkin Pushed as commit c2931c90fd4079271b69a8a393dfe277634f3e52.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk[bot] avatar Sep 25 '24 06:09 openjdk[bot]