jdk
jdk copied to clipboard
8324751: C2 SuperWord: Aliasing Analysis runtime check
This is a big patch, but about 3.5k lines are tests. And a large part of the VM changes is comments / proofs.
I am adding a dynamic (runtime) aliasing check to the auto-vectorizer (SuperWord). We use the infrastructure from https://github.com/openjdk/jdk/pull/22016:
- Use the auto-vectorization
predicatewhen available: we speculate that there is no aliasing, else we trap and re-compile without the predicate. - If the predicate is not available, we use
multiversioning, i.e. we have afast_loopwhere there is no aliasing, and hence vectorization. And aslow_loopif the check fails, with no vectorization.
Where to start reviewing
-
src/hotspot/share/opto/mempointer.hpp:- Read the class comment for
MemPointerRawSummand. - Familiarize yourself with the
MemPointer Linearity Corrolary. We need it for the proofs of the aliasing runtime checks.
- Read the class comment for
-
src/hotspot/share/opto/vectorization.cpp:- Read the explanations and proofs above
VPointer::can_make_speculative_aliasing_check_with. It explains how the aliasing runtime check works.
- Read the explanations and proofs above
-
src/hotspot/share/opto/vtransform.hpp:- Understand the difference between weak and strong edges.
If you need to see some examples, then look at the tests:
test/hotspot/jtreg/compiler/loopopts/superword/TestAliasing.java: simple array cases. IR rules that check for vectors and in somecases if we used multiversioning.test/micro/org/openjdk/bench/vm/compiler/VectorAliasing.java: the miro-benchmarks I show below. Simple array cases.test/hotspot/jtreg/compiler/loopopts/superword/TestMemorySegmentAliasing.java: a bit advanced, but similar cases.test/hotspot/jtreg/compiler/loopopts/superword/TestAliasingFuzzer.java: very large and rather compliex. Generates random loops, some with and some without aliasing at runtime. IR verification, but mostly currently only for array cases, MemorySegment cases have some issues (see comments).
Details
Most fundamentally:
- I had to refactor / extend
MemPointerso that we have access toMemPointerRawSummands. - These raw summands us to reconstruct the
VPointerat anyivvalue withVPointer::make_pointer_expression(Node* iv_value).- With the raw summands, a pointer may look like this:
p = base + ConvI2L(x + 2) + ConvI2L(y + 2) - With "regular" summands, this gets simplified to
p = base + 4L +ConvI2L(x) + ConvI2L(y) - For aliasing analysis (adjacency and overlap), the "regular" summands are sufficient. But for reconstructing the pointer expression, this could lead to overflow issues.
- With the raw summands, a pointer may look like this:
- We need to evaluate the pointer expression at
initto create the check inVPointer::make_speculative_aliasing_check_with. - I wrote up a
MemPointer Linearity Corrolarythat I need for the guarantees in the runtime checks.
I also had to enhance the VLoopDependencyGraph:
- We define
weakandstrongmemory edges:strongare edges that cannot be removed.weakare edges that can be removed, and the operations can be reordered, but if reordered we need a runtime check. MemPointer::always_overlaps_with: allows us to check if a memory edge is always strict, because it always aliases (= overlaps).
Further:
- I added flags
UseAutoVectorizationPredicateandUseAutoVectorizationSpeculativeAliasingChecks.
Benchmark
Labels / Columns:
no_check=-XX:-UseAutoVectorizationSpeculativeAliasingChecks- like before this patch.normal=-XX:+UseSuperWordno_slow_opt=-XX:-LoopMultiversioningOptimizeSlowLoop- to prove that we need to optimize the slow loop, for the case where the dynamic check fails.no_sw=-XX:-UseSuperWord- No vectorization, also has different unrolling.not_profitable=-XX:AutoVectorizationOverrideProfitability=0- No vectorization, but keep unrolling the same. Can lead to severe performance regressions especially for byte cases. We have seen similar issues before, e.g. https://github.com/openjdk/jdk/pull/25387 forbyte,charandshortcases in reduction loops.
Discussion:
?_sameIndex_aliasand?_sameIndex_noalias: Since we havesameIndex, we already can prove that we can vectorize without checks. We already vectorized these before this patch.?_differentIndex_noalias,?_half,?_partial_overlap: only vectorizes with dynamic aliasing check.?_differentIndex_alias: cannot use vectorized loop. We now use theslow_loop, and if it is not optimized (unrolled), we get a heavy slowdown (0.35).
Regular performance testing: no significant change. Except some possible improvments in Crypto-SecureRandomBench_nextBytes. A quick investigation showed that it had at least one loop where the load and the store have different invariants, which requires aliasing analysis runtime checks to prove that the load and store do not alias.
Follow-up Work
ResourceMark could not be added in VTransform::apply_speculative_aliasing_runtime_checks, it would require that _idom and _dom_depth in PhaseIdealLoop::set_idom are not ResouceArea allocated. Related issue:
- JDK-8337015 Revisit resource arena allocations in C2
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8324751: C2 SuperWord: Aliasing Analysis runtime check (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/24278/head:pull/24278
$ git checkout pull/24278
Update a local copy of the PR:
$ git checkout pull/24278
$ git pull https://git.openjdk.org/jdk.git pull/24278/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 24278
View PR using the GUI difftool:
$ git pr show -t 24278
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/24278.diff
Using Webrev
:wave: Welcome back epeter! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@eme64 This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8324751: C2 SuperWord: Aliasing Analysis runtime check
Reviewed-by: kvn, mhaessig
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 45 new commits pushed to the master branch:
- bd4c0f4a7da9122527dd25df74797c42deaced3c: 8358618: UnsupportedOperationException constructors javadoc is not clear
- f1c0b4ed722bf4cc5f262e804cec26d59ceb6e8b: 8361495: (fc) Async close of streams connected to uninterruptible FileChannel doesn't throw AsynchronousCloseException in all cases
- b43c2c663567e59f8b5c84b1b45536078190605b: 8366225: Linux Alpine (fast)debug build fails after JDK-8365909
- ... and 42 more: https://git.openjdk.org/jdk/compare/45726a1f8b8f76586037867a32b82f8ab9b96937...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@eme64 The following label will be automatically applied to this pull request:
hotspot-compiler
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
@eme64 this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout JDK-8324751-Aliasing-Analysis-RTC
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
@eme64 This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Webrevs
- 22: Full (a36e3f7a)
- 21: Full - Incremental (2cfe1097)
- 20: Full - Incremental (198bff79)
- 19: Full - Incremental (d718bd3f)
- 18: Full - Incremental (a00b385c)
- 17: Full - Incremental (8480d814)
- 16: Full - Incremental (f84ec341)
- 15: Full - Incremental (41e45bf3)
- 14: Full - Incremental (4fb1bc11)
- 13: Full (67c6dd74)
- 12: Full - Incremental (a5fdf97b)
- 11: Full - Incremental (1fc7caa0)
- 10: Full - Incremental (0180dd27)
- 09: Full - Incremental (e6e790eb)
- 08: Full - Incremental (21ea9b2b)
- 07: Full - Incremental (4a240226)
- 06: Full - Incremental (e05b6297)
- 05: Full - Incremental (238342ae)
- 04: Full - Incremental (8f1f9329)
- 03: Full - Incremental (2e353a51)
- 02: Full - Incremental (6bd997ec)
- 01: Full (d7e856d8)
- 00: Full (c260df26)
@eme64 This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open pull request command.
/open
@eme64 This pull request is now open
@eme64 This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply issue a /touch or /keepalive command to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
/touch
Still hoping for reviewers :)
@eme64 The pull request is being re-evaluated and the inactivity timeout has been reset.
@mhaessig Thanks for reviewing! I fixed the merge conflict, and addressed all your comments :)
⚠️ @eme64 This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).
@mhaessig Thanks for the responses. I integrated the one suggestion now, I think it is ready for another round :blush:
@mhaessig Thanks for the detailed review! I think I responded to all your suggestions/comments.
@chhagedorn Thanks for the drive-by comments about the Predicate documentation. Are you now satisfied? Maybe @rwestrel should have a look at it too, since I completed the missing documentation from the Short-Running-Long-Loop-Predicates as well.
@eme64 did you measure how much C2 compilation time changed with these changes (all optimizations enabled)?
@eme64 did you measure how much C2 compilation time changed with these changes (all optimizations enabled)?
I did not. I don't think it would take much extra time in almost all cases. The extra analysis is not that costly compared to unrolling that we do in all cases already. What might cost more: if we deopt because of the runtime check, and recompile with multiversioning. That could essencially double C2 compile time for those cases.
Do you think it is worth it to benchmark now, or should be just rely on @robcasloz 's occasional benchmarking and address the issues if they come up?
If you want me to do C2 time benchmarking: should I just show a few specific micro-benchmarks, or do you want to have statistics collected on larger benchmark suites?
Do you think it is worth it to benchmark now, or should be just rely on @robcasloz 's occasional benchmarking and address the issues if they come up?
I am fine with using Roberto's benchmarking later. Just keep eye on it.
@vnkozlov I ran some more benchmarks:
Columns:
not_profitable--XX:AutoVectorizationOverrideProfitability=0. Serves as baseline scalar performance. Unrolling is the same as if we vectorized.no_sw--XX:+UseSuperWord. Can mess with unrolling factor, and thus gets worse performance.patch- no flags. Overall best performance - except forbench_copy_array_B_differentIndex_aliasandbench_copy_array_I_differentIndex_alias- need to investigate :warning:no_predicate--XX:-UseAutoVectorizationPredicate. Same performance aspatch, we just always use multiversioning immediately. In a separate benchmark, I can show that this requires more C2 compile time and produces larger code - so less desirable.no_multiversioning--XX:-LoopMultiversioning: struggles with mixed cases. As soon as it encounters an aliasing case, the predicate leads to deopt, and then we recompile without predicate, and so do not vectorize any more - you get scalar performance.no_rt_check--XX:-UseAutoVectorizationSpeculativeAliasingChecks: behavior as before patch - no vectorization of runtime check required.
:warning: Investigation:
not_profitablespends 96% of runtime in the main loop, Not vectorized, 64x unrolled.patchspends 97% in main loop, not vectorized, 64x unrolled. So far I have no explanation. Strange is that I did not see this behavior earlier, when I published the benchmarks in the PR description.
Continued investigation, using perf stat. Compare metrics not_profitable vs patch:
ns/op: 2804.839 vs 3165.108 -not_profitablewins, but why?page-faults: 39,523 vs 39,604 - not relevantcycles: 18,641,133,534 vs 18,247,472,016 - similar number of cyclesinstructions: 42,579,432,139 vs 38,553,272,686 - significant deviation in work per time (10%), but why?branches: 2,470,061,665 vs 2,446,983,828 - similar amount of branches (1% diff) - how does that fit with difference in instructions?branch-misses: 85,476,483 vs 83,848,355 -> both about 3.45%tma_backend_bound: 21.3 vs 24.8 - there seems to be a bottleneck in the backend forpatchof 10% :warning:tma_bad_speculation: 21.5 vs 22.6 - speculation has minor contribution as well - actually it is 5% worse! :warning:tma_frontend_bound: 14.8 vs 14.8tma_retiring: 42.4 vs 37.7 - clearlynot_profitableexecutes code more efficiently :warning:
not_profitable has scalar 64x unrolled loop.
Head:
vmovd %xmm0,%r8d
vmovd %r8d,%xmm0
add %esi,%r8d
vmovd %xmm2,%r10d
add %esi,%r10d
movslq %r8d,%r11
movslq %r10d,%r8
movslq %esi,%r10
lea (%rax,%r10,1),%r9
lea (%r10,%rbp,1),%rbx
Repeated 64 times, with different constant offsets:
movsbl 0x4f(%rdx,%r11,1),%r10d
mov %r10b,0x4f(%rcx,%r8,1)
Tail:
add $0x40,%esi
cmp %r14d,%esi
jl 0x00007f70c8bebeb0 // jumps to head
patch has scalar 64x unrolled loop.
Head:
vmovd %xmm0,%ebx
add %r10d,%ebx
mov 0x4(%rsp),%r9d
add %r10d,%r9d
movslq %ebx,%r8
movslq %r9d,%rbx
movslq %r10d,%r9
lea (%r9,%rbp,1),%rdi
lea (%r9,%r13,1),%rax
Repeated 64x, with different constant offsets:
movsbl 0x4f(%rdx,%r8,1),%r9d
mov %r9b,0x4f(%rcx,%rbx,1)
Tail:
add $0x40,%r10d
cmp %r11d,%r10d
jl 0x00007f79a8bec950 // jumps to head
The code really looks almost identical. I'm not sure what is happening here.
I've been tying to get more info via perf stat, but my machine does not seem to support more counters. So it's difficult to see why exactly I have different percentages on tma_backend_bound and tma_bad_speculation. Maybe it is due to some missing vzeroupper or something else.
It seems that the multiversioning mode produces problems that the predicate and non-vectorized modes do not have. Strange is that a few weeks ago (see PR description) I did not have these issues, but now I see a 10% performance difference... that is a bit much.
Maybe we can accept a 10% performance regression in the edge-case where we have memory aliasing. As soon as there are some cases that have no aliasing, we get immense speedups from the vectorized loop. So it is most likely overall quite profitable to take the patch as is.
But we should investigate the performance difference anyway. So if anybody has an idea what to do, I'd be very thankful!
@vnkozlov I ran some more benchmarks:
Thank you for running benchmarks. Which one you check first for aliasing code: multiversioning or predicates?
From this experiments I think the best sequence would be (when both predicates and multiversioning are enabled):
- use predicates for aliasing (fast compilation, small code)
- if it is deoptimized recompile with multiversioning
Is this how it works now?
@vnkozlov I now automatically disable the flag if the others are both off.
I've also investigated the performance issue with the aliasing case that uses multiversioning. And I so far could not figure out the 10% performance regression, see detailed analysis attempt https://github.com/openjdk/jdk/pull/24278#issuecomment-3201092650
@vnkozlov I now automatically disable the flag if the others are both off.
Good.
I've also investigated the performance issue with the aliasing case that uses multiversioning. And I so far could not figure out the 10% performance regression, see detailed analysis attempt https://github.com/openjdk/jdk/pull/24278#issuecomment-3201092650
Is it possible it always go into slow path?
I've also investigated the performance issue with the aliasing case that uses multiversioning. And I so far could not figure out the 10% performance regression, see detailed analysis attempt #24278 (comment)
Is it possible it always go into slow path?
Yes, the aliasing case would always take the slow path. But that should be as fast as the scalar performance before the patch, and the same performance as not_profitable where we do not vectorize. The strange thing is now that we enter the slow path, but somehow the performance is 10% lower than before. But as I showed, the scalar code is basically the same in the main loop that we execute. Something must be causing the 10% difference...
I created a stand-alone test to be able to run perf stat without the overheads of JMH. The numbers look different, but the conclusion seems to be the same: we have differing backend_bound results: 30% vs 36%. And a drastic difference in tma_retiring as well.
Both tests run quite long, about 30sec. And compilation is done after about 1sec, so we are really measuring the steady-state.
// java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java
public class Test {
public static int size = 100_000;
public static void main(String[] args) {
byte[] a = new byte[size];
for (int i = 0; i < 1000_000; i++) {
copy_B(a, a, 0, 0, size); // always alias
}
}
public static void copy_B(byte[] a, byte b[], int aOffset, int bOffset, int size) {
for (int i = 0; i < size; i++) {
b[i + bOffset] = a[i + aOffset];
}
}
}
Running it with patch, which eventually runs with multiversioning in the slow-loop:
[empeter@emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2172 98 % b 3 Test::copy_B @ 3 (29 bytes)
2172 99 b 3 Test::copy_B (29 bytes)
2173 100 % b 4 Test::copy_B @ 3 (29 bytes)
2198 101 b 4 Test::copy_B (29 bytes)
2212 102 b 4 Test::copy_B (29 bytes)
Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch Test.java':
35,151.89 msec task-clock:u # 1.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8,692 page-faults:u # 247.270 /sec
86,730,942,915 cycles:u # 2.467 GHz
225,939,652,810 instructions:u # 2.61 insn per cycle
2,931,222,952 branches:u # 83.387 M/sec
55,264,982 branch-misses:u # 1.89% of all branches
TopdownL1 # 36.0 % tma_backend_bound
# 14.2 % tma_bad_speculation
# 3.5 % tma_frontend_bound
# 46.3 % tma_retiring
35.111092609 seconds time elapsed
34.819260000 seconds user
0.257300000 seconds sys
Running with not_profitable, which compiles only with a single scalar loop:
[empeter@emanuel bin]$ perf stat ../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0 Test.java
CompileCommand: compileonly Test.copy* bool compileonly = true
CompileCommand: PrintCompilation Test.copy* bool PrintCompilation = true
2196 98 % b 3 Test::copy_B @ 3 (29 bytes)
2196 99 b 3 Test::copy_B (29 bytes)
2197 100 % b 4 Test::copy_B @ 3 (29 bytes)
2210 101 b 4 Test::copy_B (29 bytes)
Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:+UnlockDiagnosticVMOptions -XX:AutoVectorizationOverrideProfitability=0 Test.java':
31,205.82 msec task-clock:u # 1.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8,029 page-faults:u # 257.292 /sec
76,952,997,639 cycles:u # 2.466 GHz
228,849,251,864 instructions:u # 2.97 insn per cycle
2,894,918,583 branches:u # 92.769 M/sec
55,022,648 branch-misses:u # 1.90% of all branches
TopdownL1 # 30.6 % tma_backend_bound
# 13.1 % tma_bad_speculation
# 3.0 % tma_frontend_bound
# 53.4 % tma_retiring
31.161118421 seconds time elapsed
30.853187000 seconds user
0.303616000 seconds sys
I also ran an experiment where I artificially disabled vectorization in the fast-loop for multiversioning. Just in case that somehow had an influence on the slow-loop.... but that does not change the 10% difference.
Also changing size=1000_000 and adjusting the repetitions to 100_000 does not change the outcome (maybe lowers the branch misprediction slightly).
I also tried to play with loop code alignment, but we keep the 10% difference:
Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:MaxLoopPad=10000 -XX:OptoLoopAlignment=128 -XX:+UnlockExperimentalVMOptions -XX:CodeEntryAlignment=128 -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=printassembly,Test::copy* Test.java':
33,769.60 msec task-clock:u # 1.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8,949 page-faults:u # 265.002 /sec
83,276,272,307 cycles:u # 2.466 GHz
225,687,718,484 instructions:u # 2.71 insn per cycle
2,892,292,576 branches:u # 85.648 M/sec
53,822,209 branch-misses:u # 1.86% of all branches
TopdownL1 # 33.1 % tma_backend_bound
# 18.1 % tma_bad_speculation
# 2.9 % tma_frontend_bound
# 45.9 % tma_retiring
33.732766948 seconds time elapsed
33.393703000 seconds user
0.329370000 seconds sys
vs
Performance counter stats for '../../../linux-x64/jdk/bin/java -XX:CompileCommand=compileonly,Test::copy* -XX:CompileCommand=printcompilation,Test::copy* -Xbatch -XX:MaxLoopPad=10000 -XX:OptoLoopAlignment=128 -XX:+UnlockExperimentalVMOptions -XX:CodeEntryAlignment=128 -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=printassembly,Test::copy* -XX:AutoVectorizationOverrideProfitability=0 Test.java':
31,201.05 msec task-clock:u # 1.001 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
8,266 page-faults:u # 264.927 /sec
76,917,123,162 cycles:u # 2.465 GHz
228,567,013,995 instructions:u # 2.97 insn per cycle
2,844,199,474 branches:u # 91.157 M/sec
52,808,358 branch-misses:u # 1.86% of all branches
TopdownL1 # 32.3 % tma_backend_bound
# 10.8 % tma_bad_speculation
# 2.7 % tma_frontend_bound
# 54.2 % tma_retiring
31.160664433 seconds time elapsed
30.849468000 seconds user
0.310109000 seconds sys
FYI, I did see main loop alignment:
0x00007fab00bb31c6: data16 data16 nopw 0x0(%rax,%rax,1)
0x00007fab00bb31d1: data16 data16 xchg %ax,%ax
0x00007fab00bb31d5: data16 data16 nopw 0x0(%rax,%rax,1)
0x00007fab00bb31e0: data16 data16 xchg %ax,%ax
0x00007fab00bb31e4: data16 data16 nopw 0x0(%rax,%rax,1)
0x00007fab00bb31ef: data16 data16 xchg %ax,%ax
0x00007fab00bb31f3: nopw 0x0(%rax,%rax,1)
0x00007fab00bb31fc: data16 data16 xchg %ax,%ax
----------- start main loop --------
0x00007fab00bb3200: vmovd %xmm0,%ecx
0x00007fab00bb3204: add %r10d,%ecx
0x00007fab00bb3207: mov 0x4(%rsp),%r9d
0x00007fab00bb320c: add %r10d,%r9d
0x00007fab00bb320f: movslq %ecx,%r8
0x00007fab00bb3212: movslq %r9d,%rcx
0x00007fab00bb3215: movslq %r10d,%r9
0x00007fab00bb3218: lea (%r9,%rbp,1),%rbx
0x00007fab00bb321c: lea (%r9,%r13,1),%rax
0x00007fab00bb3220: movsbl 0x10(%rsi,%rax,1),%r9d
0x00007fab00bb3226: mov %r9b,0x10(%rdx,%rbx,1)
0x00007fab00bb322b: movsbl 0x11(%rsi,%rax,1),%r9d
0x00007fab00bb3231: mov %r9b,0x11(%rdx,%rbx,1)
0x00007fab00bb3236: movsbl 0x12(%rsi,%rax,1),%r9d
0x00007fab00bb323c: mov %r9b,0x12(%rdx,%rbx,1)
0x00007fab00bb3241: movsbl 0x13(%rsi,%r8,1),%r9d
0x00007fab00bb3247: mov %r9b,0x13(%rdx,%rcx,1)
I'm going to run the benchmarks on our benchmarking servers now, just to see if this can be reproduced across platforms.
It would be nice to have code profiling tool which could show which part in code for these two cases is hot. Instead of guessing based on whole system behaviors.