jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8324751: C2 SuperWord: Aliasing Analysis runtime check

Open eme64 opened this issue 8 months ago • 9 comments

This is a big patch, but about 3.5k lines are tests. And a large part of the VM changes is comments / proofs.

I am adding a dynamic (runtime) aliasing check to the auto-vectorizer (SuperWord). We use the infrastructure from https://github.com/openjdk/jdk/pull/22016:

  • Use the auto-vectorization predicate when available: we speculate that there is no aliasing, else we trap and re-compile without the predicate.
  • If the predicate is not available, we use multiversioning, i.e. we have a fast_loop where there is no aliasing, and hence vectorization. And a slow_loop if the check fails, with no vectorization.

Where to start reviewing

  • src/hotspot/share/opto/mempointer.hpp:

    • Read the class comment for MemPointerRawSummand.
    • Familiarize yourself with the MemPointer Linearity Corrolary. We need it for the proofs of the aliasing runtime checks.
  • src/hotspot/share/opto/vectorization.cpp:

    • Read the explanations and proofs above VPointer::can_make_speculative_aliasing_check_with. It explains how the aliasing runtime check works.
  • src/hotspot/share/opto/vtransform.hpp:

    • Understand the difference between weak and strong edges.

If you need to see some examples, then look at the tests:

  • test/hotspot/jtreg/compiler/loopopts/superword/TestAliasing.java: simple array cases. IR rules that check for vectors and in somecases if we used multiversioning.
  • test/micro/org/openjdk/bench/vm/compiler/VectorAliasing.java: the miro-benchmarks I show below. Simple array cases.
  • test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentAliasing.java: a bit advanced, but similar cases.
  • test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java: very large and rather compliex. Generates random loops, some with and some without aliasing at runtime. IR verification, but mostly currently only for array cases, MemorySegment cases have some issues (see comments).

Details

Most fundamentally:

  • I had to refactor / extend MemPointer so that we have access to MemPointerRawSummands.
  • These raw summands us to reconstruct the VPointer at any iv value with VPointer::make_pointer_expression(Node* iv_value).
    • With the raw summands, a pointer may look like this: p = base + ConvI2L(x + 2) + ConvI2L(y + 2)
    • With "regular" summands, this gets simplified to p = base + 4L +ConvI2L(x) + ConvI2L(y)
    • For aliasing analysis (adjacency and overlap), the "regular" summands are sufficient. But for reconstructing the pointer expression, this could lead to overflow issues.
  • We need to evaluate the pointer expression at init to create the check in VPointer::make_speculative_aliasing_check_with.
  • I wrote up a MemPointer Linearity Corrolary that I need for the guarantees in the runtime checks.

I also had to enhance the VLoopDependencyGraph:

  • We define weak and strong memory edges: strong are edges that cannot be removed. weak are edges that can be removed, and the operations can be reordered, but if reordered we need a runtime check.
  • MemPointer::always_overlaps_with: allows us to check if a memory edge is always strict, because it always aliases (= overlaps).

Further:

  • I added flags UseAutoVectorizationPredicate and UseAutoVectorizationSpeculativeAliasingChecks.

Benchmark

image

Labels / Columns:

  • no_check = -XX:-UseAutoVectorizationSpeculativeAliasingChecks - like before this patch.
  • normal = -XX:+UseSuperWord
  • no_slow_opt = -XX:-LoopMultiversioningOptimizeSlowLoop - to prove that we need to optimize the slow loop, for the case where the dynamic check fails.
  • no_sw = -XX:-UseSuperWord - No vectorization, also has different unrolling.
  • not_profitable = -XX:AutoVectorizationOverrideProfitability=0 - No vectorization, but keep unrolling the same. Can lead to severe performance regressions especially for byte cases. We have seen similar issues before, e.g. https://github.com/openjdk/jdk/pull/25387 for byte, char and short cases in reduction loops.

Discussion:

  • ?_sameIndex_alias and ?_sameIndex_noalias: Since we have sameIndex, we already can prove that we can vectorize without checks. We already vectorized these before this patch.
  • ?_differentIndex_noalias, ?_half, ?_partial_overlap: only vectorizes with dynamic aliasing check.
  • ?_differentIndex_alias: cannot use vectorized loop. We now use the slow_loop, and if it is not optimized (unrolled), we get a heavy slowdown (0.35).

Regular performance testing: no significant change. Except some possible improvments in Crypto-SecureRandomBench_nextBytes. A quick investigation showed that it had at least one loop where the load and the store have different invariants, which requires aliasing analysis runtime checks to prove that the load and store do not alias.

image


Follow-up Work

ResourceMark could not be added in VTransform::apply_speculative_aliasing_runtime_checks, it would require that _idom and _dom_depth in PhaseIdealLoop::set_idom are not ResouceArea allocated. Related issue:


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8324751: C2 SuperWord: Aliasing Analysis runtime check (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24278/head:pull/24278
$ git checkout pull/24278

Update a local copy of the PR:
$ git checkout pull/24278
$ git pull https://git.openjdk.org/jdk.git pull/24278/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 24278

View PR using the GUI difftool:
$ git pr show -t 24278

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24278.diff

Using Webrev

Link to Webrev Comment

eme64 avatar Mar 27 '25 13:03 eme64

:wave: Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Mar 27 '25 13:03 bridgekeeper[bot]

@eme64 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8324751: C2 SuperWord: Aliasing Analysis runtime check

Reviewed-by: kvn, mhaessig

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 45 new commits pushed to the master branch:

  • bd4c0f4a7da9122527dd25df74797c42deaced3c: 8358618: UnsupportedOperationException constructors javadoc is not clear
  • f1c0b4ed722bf4cc5f262e804cec26d59ceb6e8b: 8361495: (fc) Async close of streams connected to uninterruptible FileChannel doesn't throw AsynchronousCloseException in all cases
  • b43c2c663567e59f8b5c84b1b45536078190605b: 8366225: Linux Alpine (fast)debug build fails after JDK-8365909
  • ... and 42 more: https://git.openjdk.org/jdk/compare/45726a1f8b8f76586037867a32b82f8ab9b96937...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Mar 27 '25 13:03 openjdk[bot]

@eme64 The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Mar 27 '25 13:03 openjdk[bot]

@eme64 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8324751-Aliasing-Analysis-RTC
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

openjdk[bot] avatar Apr 21 '25 11:04 openjdk[bot]

@eme64 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar May 22 '25 18:05 bridgekeeper[bot]

@eme64 This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.

bridgekeeper[bot] avatar Jun 26 '25 13:06 bridgekeeper[bot]

/open

eme64 avatar Jun 26 '25 13:06 eme64

@eme64 This pull request is now open

openjdk[bot] avatar Jun 26 '25 13:06 openjdk[bot]

@eme64 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper[bot] avatar Jul 24 '25 19:07 bridgekeeper[bot]

/touch

Still hoping for reviewers :)

eme64 avatar Jul 25 '25 17:07 eme64

@eme64 The pull request is being re-evaluated and the inactivity timeout has been reset.

openjdk[bot] avatar Jul 25 '25 17:07 openjdk[bot]

@mhaessig Thanks for reviewing! I fixed the merge conflict, and addressed all your comments :)

eme64 avatar Aug 03 '25 08:08 eme64

⚠️ @eme64 This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

openjdk[bot] avatar Aug 03 '25 08:08 openjdk[bot]

@mhaessig Thanks for the responses. I integrated the one suggestion now, I think it is ready for another round :blush:

eme64 avatar Aug 11 '25 00:08 eme64

@mhaessig Thanks for the detailed review! I think I responded to all your suggestions/comments.

eme64 avatar Aug 12 '25 17:08 eme64

@chhagedorn Thanks for the drive-by comments about the Predicate documentation. Are you now satisfied? Maybe @rwestrel should have a look at it too, since I completed the missing documentation from the Short-Running-Long-Loop-Predicates as well.

eme64 avatar Aug 14 '25 14:08 eme64

@eme64 did you measure how much C2 compilation time changed with these changes (all optimizations enabled)?

vnkozlov avatar Aug 17 '25 22:08 vnkozlov

@eme64 did you measure how much C2 compilation time changed with these changes (all optimizations enabled)?

I did not. I don't think it would take much extra time in almost all cases. The extra analysis is not that costly compared to unrolling that we do in all cases already. What might cost more: if we deopt because of the runtime check, and recompile with multiversioning. That could essencially double C2 compile time for those cases.

Do you think it is worth it to benchmark now, or should be just rely on @robcasloz 's occasional benchmarking and address the issues if they come up?

If you want me to do C2 time benchmarking: should I just show a few specific micro-benchmarks, or do you want to have statistics collected on larger benchmark suites?

eme64 avatar Aug 18 '25 07:08 eme64

Do you think it is worth it to benchmark now, or should be just rely on @robcasloz 's occasional benchmarking and address the issues if they come up?

I am fine with using Roberto's benchmarking later. Just keep eye on it.

vnkozlov avatar Aug 18 '25 14:08 vnkozlov

@vnkozlov I ran some more benchmarks:

image

Columns:

  • not_profitable - -XX:AutoVectorizationOverrideProfitability=0. Serves as baseline scalar performance. Unrolling is the same as if we vectorized.
  • no_sw - -XX:+UseSuperWord. Can mess with unrolling factor, and thus gets worse performance.
  • patch - no flags. Overall best performance - except for bench_copy_array_B_differentIndex_alias and bench_copy_array_I_differentIndex_alias - need to investigate :warning:
  • no_predicate - -XX:-UseAutoVectorizationPredicate. Same performance as patch, we just always use multiversioning immediately. In a separate benchmark, I can show that this requires more C2 compile time and produces larger code - so less desirable.
  • no_multiversioning - -XX:-LoopMultiversioning: struggles with mixed cases. As soon as it encounters an aliasing case, the predicate leads to deopt, and then we recompile without predicate, and so do not vectorize any more - you get scalar performance.
  • no_rt_check - -XX:-UseAutoVectorizationSpeculativeAliasingChecks: behavior as before patch - no vectorization of runtime check required.

:warning: Investigation:

  • not_profitable spends 96% of runtime in the main loop, Not vectorized, 64x unrolled.
  • patch spends 97% in main loop, not vectorized, 64x unrolled. So far I have no explanation. Strange is that I did not see this behavior earlier, when I published the benchmarks in the PR description.

Continued investigation, using perf stat. Compare metrics not_profitable vs patch:

  • ns/op: 2804.839 vs 3165.108 - not_profitable wins, but why?
  • page-faults: 39,523 vs 39,604 - not relevant
  • cycles: 18,641,133,534 vs 18,247,472,016 - similar number of cycles
  • instructions: 42,579,432,139 vs 38,553,272,686 - significant deviation in work per time (10%), but why?
  • branches: 2,470,061,665 vs 2,446,983,828 - similar amount of branches (1% diff) - how does that fit with difference in instructions?
  • branch-misses: 85,476,483 vs 83,848,355 -> both about 3.45%
  • tma_backend_bound: 21.3 vs 24.8 - there seems to be a bottleneck in the backend for patch of 10% :warning:
  • tma_bad_speculation: 21.5 vs 22.6 - speculation has minor contribution as well - actually it is 5% worse! :warning:
  • tma_frontend_bound: 14.8 vs 14.8
  • tma_retiring: 42.4 vs 37.7 - clearly not_profitable executes code more efficiently :warning:

not_profitable has scalar 64x unrolled loop. Head:

vmovd  %xmm0,%r8d
vmovd  %r8d,%xmm0
add    %esi,%r8d
vmovd  %xmm2,%r10d
add    %esi,%r10d
movslq %r8d,%r11
movslq %r10d,%r8
movslq %esi,%r10
lea    (%rax,%r10,1),%r9
lea    (%r10,%rbp,1),%rbx

Repeated 64 times, with different constant offsets:

movsbl 0x4f(%rdx,%r11,1),%r10d
mov    %r10b,0x4f(%rcx,%r8,1)

Tail:

add    $0x40,%esi
cmp    %r14d,%esi
jl     0x00007f70c8bebeb0 // jumps to head

patch has scalar 64x unrolled loop. Head:

vmovd  %xmm0,%ebx
add    %r10d,%ebx
mov    0x4(%rsp),%r9d
add    %r10d,%r9d
movslq %ebx,%r8
movslq %r9d,%rbx
movslq %r10d,%r9
lea    (%r9,%rbp,1),%rdi
lea    (%r9,%r13,1),%rax

Repeated 64x, with different constant offsets:

movsbl 0x4f(%rdx,%r8,1),%r9d
mov    %r9b,0x4f(%rcx,%rbx,1)

Tail:

add    $0x40,%r10d
cmp    %r11d,%r10d
jl     0x00007f79a8bec950 // jumps to head

The code really looks almost identical. I'm not sure what is happening here.


I've been tying to get more info via perf stat, but my machine does not seem to support more counters. So it's difficult to see why exactly I have different percentages on tma_backend_bound and tma_bad_speculation. Maybe it is due to some missing vzeroupper or something else.

It seems that the multiversioning mode produces problems that the predicate and non-vectorized modes do not have. Strange is that a few weeks ago (see PR description) I did not have these issues, but now I see a 10% performance difference... that is a bit much.

Maybe we can accept a 10% performance regression in the edge-case where we have memory aliasing. As soon as there are some cases that have no aliasing, we get immense speedups from the vectorized loop. So it is most likely overall quite profitable to take the patch as is.

But we should investigate the performance difference anyway. So if anybody has an idea what to do, I'd be very thankful!

eme64 avatar Aug 19 '25 14:08 eme64

@vnkozlov I ran some more benchmarks:

Thank you for running benchmarks. Which one you check first for aliasing code: multiversioning or predicates?

From this experiments I think the best sequence would be (when both predicates and multiversioning are enabled):

  • use predicates for aliasing (fast compilation, small code)
  • if it is deoptimized recompile with multiversioning

Is this how it works now?

vnkozlov avatar Aug 19 '25 16:08 vnkozlov

@vnkozlov I now automatically disable the flag if the others are both off.

I've also investigated the performance issue with the aliasing case that uses multiversioning. And I so far could not figure out the 10% performance regression, see detailed analysis attempt https://github.com/openjdk/jdk/pull/24278#issuecomment-3201092650

eme64 avatar Aug 20 '25 12:08 eme64

@vnkozlov I now automatically disable the flag if the others are both off.

Good.

vnkozlov avatar Aug 20 '25 15:08 vnkozlov

I've also investigated the performance issue with the aliasing case that uses multiversioning. And I so far could not figure out the 10% performance regression, see detailed analysis attempt https://github.com/openjdk/jdk/pull/24278#issuecomment-3201092650

Is it possible it always go into slow path?

vnkozlov avatar Aug 20 '25 15:08 vnkozlov

I've also investigated the performance issue with the aliasing case that uses multiversioning. And I so far could not figure out the 10% performance regression, see detailed analysis attempt #24278 (comment)

Is it possible it always go into slow path?

Yes, the aliasing case would always take the slow path. But that should be as fast as the scalar performance before the patch, and the same performance as not_profitable where we do not vectorize. The strange thing is now that we enter the slow path, but somehow the performance is 10% lower than before. But as I showed, the scalar code is basically the same in the main loop that we execute. Something must be causing the 10% difference...

eme64 avatar Aug 21 '25 06:08 eme64

I created a stand-alone test to be able to run perf stat without the overheads of JMH. The numbers look different, but the conclusion seems to be the same: we have differing backend_bound results: 30% vs 36%. And a drastic difference in tma_retiring as well.

Both tests run quite long, about 30sec. And compilation is done after about 1sec, so we are really measuring the steady-state.

// java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java

public class Test {
    public static int size = 100_000;

    public static void main(String[] args) {
        byte[] a = new byte[size];
        for (int i = 0; i < 1000_000; i++) {
            copy_B(a, a, 0, 0, size); // always alias
        }
    }

    public static void copy_B(byte[] a, byte b[], int aOffset, int bOffset, int size) {
        for (int i = 0; i < size; i++) {
            b[i + bOffset] = a[i + aOffset];
        }
    }
}

Running it with patch, which eventually runs with multiversioning in the slow-loop:

[empeter@emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2172   98 %  b  3       Test::copy_B @ 3 (29 bytes)
2172   99    b  3       Test::copy_B (29 bytes)
2173  100 %  b  4       Test::copy_B @ 3 (29 bytes)
2198  101    b  4       Test::copy_B (29 bytes)
2212  102    b  4       Test::copy_B (29 bytes)

 Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java':

         35,151.89 msec task-clock:u                     #    1.001 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             8,692      page-faults:u                    #  247.270 /sec                      
    86,730,942,915      cycles:u                         #    2.467 GHz                       
   225,939,652,810      instructions:u                   #    2.61  insn per cycle            
     2,931,222,952      branches:u                       #   83.387 M/sec                     
        55,264,982      branch-misses:u                  #    1.89% of all branches           
                        TopdownL1                 #     36.0 %  tma_backend_bound      
                                                  #     14.2 %  tma_bad_speculation    
                                                  #      3.5 %  tma_frontend_bound     
                                                  #     46.3 %  tma_retiring           

      35.111092609 seconds time elapsed

      34.819260000 seconds user
       0.257300000 seconds sys

Running with not_profitable, which compiles only with a single scalar loop:

[empeter@emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0  Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2196   98 %  b  3       Test::copy_B @ 3 (29 bytes)
2196   99    b  3       Test::copy_B (29 bytes)
2197  100 %  b  4       Test::copy_B @ 3 (29 bytes)
2210  101    b  4       Test::copy_B (29 bytes)

 Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0 Test.java':

         31,205.82 msec task-clock:u                     #    1.001 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             8,029      page-faults:u                    #  257.292 /sec                      
    76,952,997,639      cycles:u                         #    2.466 GHz                       
   228,849,251,864      instructions:u                   #    2.97  insn per cycle            
     2,894,918,583      branches:u                       #   92.769 M/sec                     
        55,022,648      branch-misses:u                  #    1.90% of all branches           
                        TopdownL1                 #     30.6 %  tma_backend_bound      
                                                  #     13.1 %  tma_bad_speculation    
                                                  #      3.0 %  tma_frontend_bound     
                                                  #     53.4 %  tma_retiring           

      31.161118421 seconds time elapsed

      30.853187000 seconds user
       0.303616000 seconds sys

eme64 avatar Aug 21 '25 11:08 eme64

I also ran an experiment where I artificially disabled vectorization in the fast-loop for multiversioning. Just in case that somehow had an influence on the slow-loop.... but that does not change the 10% difference.

Also changing size=1000_000 and adjusting the repetitions to 100_000 does not change the outcome (maybe lowers the branch misprediction slightly).

I also tried to play with loop code alignment, but we keep the 10% difference:

 Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:MaxLoopPad=10000 -XX:OptoLoopAlignment=128 -XX:+UnlockExperimentalVMOptions -XX:CodeEntryAlignment=128 -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=printassembly,Test::copy* Test.java':

         33,769.60 msec task-clock:u                     #    1.001 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             8,949      page-faults:u                    #  265.002 /sec                      
    83,276,272,307      cycles:u                         #    2.466 GHz                       
   225,687,718,484      instructions:u                   #    2.71  insn per cycle            
     2,892,292,576      branches:u                       #   85.648 M/sec                     
        53,822,209      branch-misses:u                  #    1.86% of all branches           
                        TopdownL1                 #     33.1 %  tma_backend_bound      
                                                  #     18.1 %  tma_bad_speculation    
                                                  #      2.9 %  tma_frontend_bound     
                                                  #     45.9 %  tma_retiring           

      33.732766948 seconds time elapsed

      33.393703000 seconds user
       0.329370000 seconds sys

vs

 Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:MaxLoopPad=10000 -XX:OptoLoopAlignment=128 -XX:+UnlockExperimentalVMOptions -XX:CodeEntryAlignment=128 -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=printassembly,Test::copy* -XX:AutoVectorizationOverrideProfitability=0 Test.java':

         31,201.05 msec task-clock:u                     #    1.001 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
             8,266      page-faults:u                    #  264.927 /sec                      
    76,917,123,162      cycles:u                         #    2.465 GHz                       
   228,567,013,995      instructions:u                   #    2.97  insn per cycle            
     2,844,199,474      branches:u                       #   91.157 M/sec                     
        52,808,358      branch-misses:u                  #    1.86% of all branches           
                        TopdownL1                 #     32.3 %  tma_backend_bound      
                                                  #     10.8 %  tma_bad_speculation    
                                                  #      2.7 %  tma_frontend_bound     
                                                  #     54.2 %  tma_retiring           

      31.160664433 seconds time elapsed

      30.849468000 seconds user
       0.310109000 seconds sys

FYI, I did see main loop alignment:

  0x00007fab00bb31c6:   data16 data16 nopw 0x0(%rax,%rax,1)
  0x00007fab00bb31d1:   data16 data16 xchg %ax,%ax
  0x00007fab00bb31d5:   data16 data16 nopw 0x0(%rax,%rax,1)
  0x00007fab00bb31e0:   data16 data16 xchg %ax,%ax
  0x00007fab00bb31e4:   data16 data16 nopw 0x0(%rax,%rax,1)
  0x00007fab00bb31ef:   data16 data16 xchg %ax,%ax
  0x00007fab00bb31f3:   nopw   0x0(%rax,%rax,1)
  0x00007fab00bb31fc:   data16 data16 xchg %ax,%ax
  ----------- start main loop --------
  0x00007fab00bb3200:   vmovd  %xmm0,%ecx
  0x00007fab00bb3204:   add    %r10d,%ecx
  0x00007fab00bb3207:   mov    0x4(%rsp),%r9d
  0x00007fab00bb320c:   add    %r10d,%r9d
  0x00007fab00bb320f:   movslq %ecx,%r8
  0x00007fab00bb3212:   movslq %r9d,%rcx
  0x00007fab00bb3215:   movslq %r10d,%r9
  0x00007fab00bb3218:   lea    (%r9,%rbp,1),%rbx
  0x00007fab00bb321c:   lea    (%r9,%r13,1),%rax
  0x00007fab00bb3220:   movsbl 0x10(%rsi,%rax,1),%r9d
  0x00007fab00bb3226:   mov    %r9b,0x10(%rdx,%rbx,1)
  0x00007fab00bb322b:   movsbl 0x11(%rsi,%rax,1),%r9d
  0x00007fab00bb3231:   mov    %r9b,0x11(%rdx,%rbx,1)
  0x00007fab00bb3236:   movsbl 0x12(%rsi,%rax,1),%r9d
  0x00007fab00bb323c:   mov    %r9b,0x12(%rdx,%rbx,1)
  0x00007fab00bb3241:   movsbl 0x13(%rsi,%r8,1),%r9d
  0x00007fab00bb3247:   mov    %r9b,0x13(%rdx,%rcx,1)

eme64 avatar Aug 21 '25 11:08 eme64

I'm going to run the benchmarks on our benchmarking servers now, just to see if this can be reproduced across platforms.

eme64 avatar Aug 21 '25 12:08 eme64

It would be nice to have code profiling tool which could show which part in code for these two cases is hot. Instead of guessing based on whole system behaviors.

vnkozlov avatar Aug 21 '25 18:08 vnkozlov