jdk
jdk copied to clipboard
8320448: Accelerate IndexOf using AVX2
Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers:
Benchmark Score Latest
StringIndexOf.advancedWithMediumSub 343.573 317.934 0.925375393x
StringIndexOf.advancedWithShortSub1 1039.081 1053.96 1.014319384x
StringIndexOf.advancedWithShortSub2 55.828 110.541 1.980027943x
StringIndexOf.constantPattern 9.361 11.906 1.271872663x
StringIndexOf.searchCharLongSuccess 4.216 4.218 1.000474383x
StringIndexOf.searchCharMediumSuccess 3.133 3.216 1.02649218x
StringIndexOf.searchCharShortSuccess 3.76 3.761 1.000265957x
StringIndexOf.success 9.186 9.713 1.057369911x
StringIndexOf.successBig 14.341 46.343 3.231504079x
StringIndexOfChar.latin1_AVX2_String 6220.918 12154.52 1.953814533x
StringIndexOfChar.latin1_AVX2_char 5503.556 5540.044 1.006629895x
StringIndexOfChar.latin1_SSE4_String 6978.854 6818.689 0.977049957x
StringIndexOfChar.latin1_SSE4_char 5657.499 5474.624 0.967675646x
StringIndexOfChar.latin1_Short_String 7132.541 6863.359 0.962260014x
StringIndexOfChar.latin1_Short_char 16013.389 16162.437 1.009307711x
StringIndexOfChar.latin1_mixed_String 7386.123 14771.622 1.999915517x
StringIndexOfChar.latin1_mixed_char 9901.671 9782.245 0.987938803
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8320448: Accelerate IndexOf using AVX2 (Enhancement - P4)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16753/head:pull/16753
$ git checkout pull/16753
Update a local copy of the PR:
$ git checkout pull/16753
$ git pull https://git.openjdk.org/jdk.git pull/16753/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 16753
View PR using the GUI difftool:
$ git pr show -t 16753
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16753.diff
Webrev
:wave: Welcome back sgibbons! A progress list of the required criteria for merging this PR into master
will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@asgibbons this pull request can not be integrated into master
due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:
git checkout indexof
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push
@asgibbons The following labels will be automatically applied to this pull request:
-
core-libs
-
hotspot
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.
Opening up for review. Fixed last whitespace error. I will post final performance numbers soon.
Webrevs
- 51: Full - Incremental (f432320f)
- 50: Full - Incremental (6eae46e5)
- 49: Full - Incremental (57e115d7)
- 48: Full - Incremental (3e150fe3)
- 47: Full - Incremental (ed06edd6)
- 46: Full - Incremental (db0ab75a)
- 45: Full - Incremental (355325d0)
- 44: Full - Incremental (751aace8)
- 43: Full - Incremental (01cb58fb)
- 42: Full - Incremental (15994a39)
- 41: Full - Incremental (e13c7ea4)
- 40: Full - Incremental (b154faee)
- 39: Full - Incremental (be001e2c)
- 38: Full - Incremental (69ca8d13)
- 37: Full - Incremental (485d02fd)
- 36: Full - Incremental (5d10a20b)
- 35: Full - Incremental (1a71eb10)
- 34: Full - Incremental (c034d3f9)
- 33: Full - Incremental (2283f2bf)
- 32: Full - Incremental (cba6ffbe)
- 31: Full - Incremental (23d2c511)
- 30: Full - Incremental (87b1ebe8)
- 29: Full - Incremental (40a1e628)
- 28: Full - Incremental (42af0b50)
- 27: Full - Incremental (027daf73)
- 26: Full - Incremental (ed4451d1)
- 25: Full (f4eefe1a)
- 24: Full - Incremental (b0ef5e6f)
- 23: Full - Incremental (f002fd54)
- 22: Full - Incremental (b6d77fe0)
- 21: Full - Incremental (f4ca4a5e)
- 20: Full - Incremental (38868a35)
- 19: Full - Incremental (9a861979)
- 18: Full - Incremental (fb4da92a)
- 17: Full (f52d281d)
- 16: Full - Incremental (1d141fde)
- 15: Full (8e0ce70a)
- 14: Full - Incremental (1cd1b501)
- 13: Full (e079fc12)
- 12: Full - Incremental (82bcd8b3)
- 11: Full - Incremental (62e4e8b1)
- 10: Full - Incremental (71ebf9e7)
- 09: Full - Incremental (8638a5f4)
- 08: Full - Incremental (e3e91953)
- 07: Full - Incremental (827ca245)
- 06: Full (3e58d0c2)
- 05: Full (600377b0)
- 04: Full - Incremental (63db0961)
- 03: Full - Incremental (48088348)
- 02: Full - Incremental (5e03173e)
- 01: Full - Incremental (e614b86f)
- 00: Full (60d762b9)
Latest numbers vs. top-of-tree JDK.
/label add hotspot-compiler-dev
@jatin-bhateja
The hotspot-compiler
label was successfully added.
Thank you all for the reviews. I have been asked to simplify this code to facilitate easier review, and to remove the gcc library code I used for memcmp. We also discovered that special-casing up to 31 bytes of needle was not necessary to get the performance gain. I will be commenting the code and fixing the single build failure soon.
Hi @asgibbons ,
I am getting following error with slowdebug build on Meteor Lake system.
ERROR: Build failed for target 'images' in configuration 'linux-x86_64-server-slowdebug' (exit code 2)
=== Output from failing command(s) repeated here ===
* For target buildtools_create_symbols_javac__the.COMPILE_CREATE_SYMBOLS_batch:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (/home/jatinbha/sandboxes/jdk-reviews/jdk/src/hotspot/share/asm/assembler.hpp:169), pid=237140, tid=237235
# assert(is_bound() || is_unused()) failed: Label was never bound to a location, but it was used as a jmp target
#
# JRE version: OpenJDK Runtime Environment (23.0) (slowdebug build 23-internal-adhoc.sdp.jdk)
# Java VM: OpenJDK 64-Bit Server VM (slowdebug 23-internal-adhoc.sdp.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x4c9e3e] Label::~Label()+0x5c
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/jatinbha/sandboxes/jdk-reviews/jdk/make/core.237140)
#
# An error report file with more information is saved as:
# /home/jatinbha/sandboxes/jdk-reviews/jdk/make/hs_err_pid237140.log
... (rest of output omitted)
@jatin-bhateja Thanks for the note. Fixed a couple of slowdebug compile issues. Now builds successfully on Meteor Lake system.
Hi @asgibbons ,
I am still getting build errors with latest patch on a MeteorLake system. Kindly check
ERROR: Build failed for target 'images' in configuration 'linux-x86_64-server-fastdebug' (exit code 2)
=== Output from failing command(s) repeated here ===
* For target buildtools_create_symbols_javac__the.COMPILE_CREATE_SYMBOLS_batch:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f7220495dec, pid=530719, tid=530722
#
# JRE version: OpenJDK Runtime Environment (23.0) (fastdebug build 23-internal-adhoc.root.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 23-internal-adhoc.root.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# v ~StubRoutines::stringIndexOf 0x00007f7220495dec
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/jatinbha/sandboxes/jdk-reviews/jdk/make/core.530719)
#
# An error report file with more information is saved as:
# /home/jatinbha/sandboxes/jdk-reviews/jdk/make/hs_err_pid530719.log
[5.638s][warning][os] Loading hsdis library failed
... (rest of output omitted)
* All command lines available in /home/jatinbha/sandboxes/jdk-reviews/jdk/build/linux-x86_64-server-fastdebug/make-support/failure-logs.
=== End of repeated output ===
No indication of failed target found.
HELP: Try searching the build log for '] Error'.
HELP: Run 'make doctor' to diagnose build problems.
make[1]: *** [/home/jatinbha/sandboxes/jdk-reviews/jdk/make/Init.gmk:323: main] Error 2
make: *** [/home/jatinbha/sandboxes/jdk-reviews/jdk/make/Init.gmk:189: images] Error 2
@asgibbons This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8320448: Accelerate IndexOf using AVX2
Reviewed-by: epeter, kvn, sviswanathan
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 263 new commits pushed to the master
branch:
- d744059b5b3e944bee53536de6f404666e45e8e5: 8333774: Avoid eagerly loading various EmptySpliterator classes
- d130d2f4f46d37a2b924343de19d012c129b0a55: 8333477: Delete extra empty spaces in Makefiles
- 486dee2cf420981b4c8111c24c5fbd27aceb238b: 8333653: Remove MallocHeader::get_stack
- 40b2fbd8207404961d3d23375b288cceafc3f902: 8331733: [PPC64] saving and restoring CR is not needed at most places
- 6968770b1e918c74fc009e3562a827bb4acbe2d7: 8331935: Add support for primitive array C1 clone intrinsic in PPC
- a2030fff9833aba40e8c7c177151a30a0812a250: 8332516: Serial: Always sample promoted bytes to avoid getting stuck in Full GCs
- bf7f1c41cc2a2b98775301bc377a4c6e1340a736: 8333211: NMT Reports: replace manual indentation handling with auto indent
- 8ffc35d117846a7a2aa08afed662273d2f887770: 8333724: Problem list security/infra/java/security/cert/CertPathValidator/certification/CAInterop.java#teliasonerarootcav1
- f7862bd6b9994814c6dfd43d471122408601f288: 8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store
- b4beda21b487886b022e04766e140e6d1df1038a: 8332537: C2: High memory usage reported for compiler/loopopts/superword/TestAlignVectorFuzzer.java
- ... and 253 more: https://git.openjdk.org/jdk/compare/37c477856d543163b60dd2b85a5e6ac35a752211...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@eme64, @vnkozlov, @sviswa7) but any other Committer may sponsor as well.
➡️ To flag this PR as ready for integration with the above commit message, type /integrate
in a new comment. (Afterwards, your sponsor types /sponsor
in a new comment to perform the integration).
Re-opening this PR after some insightful comments, which resulted in a re-design. Now ready for review.
Now showing ~1.6x performance gain on original StringIndexOf benchmark. I added a benchmark (StringIndexOfHuge) that measures performance on large-ish strings (e.g., 34-byte substring within a 2052-byte string). This benchmark performed on average ~14x original.
Sorry for the large change, but it couldn't be done piecemeal.
Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, which is manually loop unrolling the head processing (first < 8 bytes) to artificially make sure the branch predictor got different branches to care AND JIT won't make it wrong. We have measured (I can provide a link of the benchmark and results, If you are interested) that it delivers a much better performance on tiny strings and makes a smoother degradation vs perfectly aligned string length as well. Clearly this tends to be much visible if the input strings have shuffled delimiter positions, to make the branch prediction misses more relevant.
Hi @franz1981. I'd be interested in seeing the code. I looked here, but that just looks like a naïve implementation, so I must be missing something.
This code uses vector compares to search for the first byte of the substring (needle) in 32-byte chunks of the input string (haystack). However, once a character is found, it also checks for a match corresponding to the last byte of the needle within the haystack before doing the full needle comparison. This is also in 32-byte chunks. That is, we load a vector register with 32 (or 16 for wide chars) copies of the first byte of the needle, and another with copies of the last byte of the needle. The first comparison is done at the start of the haystack, giving us indication of the presence and index of the first byte. We then compare the last byte of the needle at the haystack indexed at needle length - 1 (i.e., the last byte). This tells us if the last byte of the needle appears in the correct relative position within the haystack. ANDing these results tells us whether or not we have a candidate needle within the haystack, as well as the position of the needle. Only then do we do a full char-by-char comparison of the needle to the haystack (this is also done with vector instructions when possible).
A lot of this code is there to handle the cases of small-ish needles within small-ish haystacks, where 32-byte reads are not possible (due to reading past the end of the strings, possibly generating a page fault). I handle less than 32-byte haystacks by copying the haystack to the stack, where I can be assured that 32-byte reads will be possible. So there are special cases for haystack < 32 bytes with needle sizes <= 10 (arbitrary) and one for haystacks > 32 bytes and needle size <= 10. I also added a section for haystacks <= 32 bytes and needle sizes < 5, which seem to be the most common cases. This path copies the haystack to the stack (a single vector read & write) and up to 5 vector comparisons, one for each byte of the needle, with no branching or looping.
I'd be very interested in seeing the Netty SWAR implementation. Thanks for the comment.
Sure thing: https://github.com/netty/netty/pull/13534#issuecomment-1685247165
It's the comparison with String::indexOf while the impl is: https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L590
This one is related an indexOf(byte) kind of search, not indexOf(byte[]}): for the latter we use a classic two way search.
Comment on behalf of @sviswa7 : Unclear whether size
in byte_compare_helper
is intended to be in bytes or in elements. Please check its consistency.
/contributor add @vpaprotsk /contributor add @sviswa7
@asgibbons @vpaprotsk
was not found in the census.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]
. For example:
-
/contributor add @openjdk-bot
-
/contributor add duke
-
/contributor add J. Duke <[email protected]>
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.
@asgibbons
Contributor Sandhya Viswanathan <[email protected]>
successfully added.
/contributor remove @sviswa7 By her request
/contributor add @vpaprots
@asgibbons
Contributor Sandhya Viswanathan <[email protected]>
successfully removed.
@asgibbons @vpaprots
was not found in the census.
Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]
. For example:
-
/contributor add @openjdk-bot
-
/contributor add duke
-
/contributor add J. Duke <[email protected]>
User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.
I submitted our testing for latest v34 version of changes.
My testing for v34 passed without new failures.
Thank you @vnkozlov . Waiting for review from @sviswa7 and @jatin-bhateja, then I'll integrate.
Thank you all for the comments. If there are no objections, I'll integrate these fixes tomorrow morning.
I've run tier1-3 tests with the appropriate options on my machine with no errors, so my confidence is high.
Let me test the latest version before integration.