jdk 8320448: Accelerate IndexOf using AVX2

Re-write the IndexOf code without the use of the pcmpestri instruction, only using AVX2 instructions. This change accelerates String.IndexOf on average 1.3x for AVX2. The benchmark numbers:

Benchmark	                                               Score		Latest		
StringIndexOf.advancedWithMediumSub   343.573		317.934		0.925375393x
StringIndexOf.advancedWithShortSub1	  1039.081		1053.96		1.014319384x
StringIndexOf.advancedWithShortSub2	      55.828		110.541		1.980027943x
StringIndexOf.constantPattern	                9.361		11.906		1.271872663x
StringIndexOf.searchCharLongSuccess	        4.216		4.218		1.000474383x
StringIndexOf.searchCharMediumSuccess	3.133		3.216		1.02649218x
StringIndexOf.searchCharShortSuccess	3.76		        3.761		1.000265957x
StringIndexOf.success	                                9.186		9.713		1.057369911x
StringIndexOf.successBig	                      14.341		46.343		3.231504079x
StringIndexOfChar.latin1_AVX2_String	  6220.918		12154.52		1.953814533x
StringIndexOfChar.latin1_AVX2_char	  5503.556		5540.044		1.006629895x
StringIndexOfChar.latin1_SSE4_String	  6978.854		6818.689		0.977049957x
StringIndexOfChar.latin1_SSE4_char	  5657.499		5474.624		0.967675646x
StringIndexOfChar.latin1_Short_String	  7132.541		6863.359		0.962260014x
StringIndexOfChar.latin1_Short_char	16013.389	      16162.437		1.009307711x
StringIndexOfChar.latin1_mixed_String	  7386.123	      14771.622		1.999915517x
StringIndexOfChar.latin1_mixed_char	  9901.671		9782.245		0.987938803

Progress

[ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
[x] Change must not contain extraneous whitespace
[x] Commit message must refer to an issue

Issue

JDK-8320448: Accelerate IndexOf using AVX2 (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/16753/head:pull/16753
$ git checkout pull/16753

Update a local copy of the PR:
$ git checkout pull/16753
$ git pull https://git.openjdk.org/jdk.git pull/16753/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 16753

View PR using the GUI difftool:
$ git pr show -t 16753

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/16753.diff

Webrev

Link to Webrev Comment

Nov 21 '23 00:11 asgibbons

:wave: Welcome back sgibbons! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

Nov 21 '23 00:11 bridgekeeper[bot]

@asgibbons this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout indexof
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Nov 21 '23 00:11 openjdk[bot]

@asgibbons The following labels will be automatically applied to this pull request:

core-libs
hotspot

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

Nov 21 '23 00:11 openjdk[bot]

Opening up for review. Fixed last whitespace error. I will post final performance numbers soon.

Nov 29 '23 01:11 asgibbons

Webrevs

Nov 29 '23 01:11 mlbridge[bot]

Latest numbers vs. top-of-tree JDK.

JBS IndexOf

Nov 29 '23 18:11 asgibbons

/label add hotspot-compiler-dev

Jan 16 '24 12:01 jatin-bhateja

@jatin-bhateja The hotspot-compiler label was successfully added.

Jan 16 '24 12:01 openjdk[bot]

Thank you all for the reviews. I have been asked to simplify this code to facilitate easier review, and to remove the gcc library code I used for memcmp. We also discovered that special-casing up to 31 bytes of needle was not necessary to get the performance gain. I will be commenting the code and fixing the single build failure soon.

Feb 14 '24 14:02 asgibbons

Hi @asgibbons ,

I am getting following error with slowdebug build on Meteor Lake system.

ERROR: Build failed for target 'images' in configuration 'linux-x86_64-server-slowdebug' (exit code 2) 

=== Output from failing command(s) repeated here ===
* For target buildtools_create_symbols_javac__the.COMPILE_CREATE_SYMBOLS_batch:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/home/jatinbha/sandboxes/jdk-reviews/jdk/src/hotspot/share/asm/assembler.hpp:169), pid=237140, tid=237235
#  assert(is_bound() || is_unused()) failed: Label was never bound to a location, but it was used as a jmp target
#
# JRE version: OpenJDK Runtime Environment (23.0) (slowdebug build 23-internal-adhoc.sdp.jdk)
# Java VM: OpenJDK 64-Bit Server VM (slowdebug 23-internal-adhoc.sdp.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x4c9e3e]  Label::~Label()+0x5c
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/jatinbha/sandboxes/jdk-reviews/jdk/make/core.237140)
#
# An error report file with more information is saved as:
# /home/jatinbha/sandboxes/jdk-reviews/jdk/make/hs_err_pid237140.log
   ... (rest of output omitted)

Feb 20 '24 09:02 jatin-bhateja

@jatin-bhateja Thanks for the note. Fixed a couple of slowdebug compile issues. Now builds successfully on Meteor Lake system.

Feb 21 '24 15:02 asgibbons

Hi @asgibbons ,

I am still getting build errors with latest patch on a MeteorLake system. Kindly check

ERROR: Build failed for target 'images' in configuration 'linux-x86_64-server-fastdebug' (exit code 2) 

=== Output from failing command(s) repeated here ===
* For target buildtools_create_symbols_javac__the.COMPILE_CREATE_SYMBOLS_batch:
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f7220495dec, pid=530719, tid=530722
#
# JRE version: OpenJDK Runtime Environment (23.0) (fastdebug build 23-internal-adhoc.root.jdk)
# Java VM: OpenJDK 64-Bit Server VM (fastdebug 23-internal-adhoc.root.jdk, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# v  ~StubRoutines::stringIndexOf 0x00007f7220495dec
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /home/jatinbha/sandboxes/jdk-reviews/jdk/make/core.530719)
#
# An error report file with more information is saved as:
# /home/jatinbha/sandboxes/jdk-reviews/jdk/make/hs_err_pid530719.log
[5.638s][warning][os] Loading hsdis library failed
   ... (rest of output omitted)

* All command lines available in /home/jatinbha/sandboxes/jdk-reviews/jdk/build/linux-x86_64-server-fastdebug/make-support/failure-logs.
=== End of repeated output ===

No indication of failed target found.
HELP: Try searching the build log for '] Error'.
HELP: Run 'make doctor' to diagnose build problems.

make[1]: *** [/home/jatinbha/sandboxes/jdk-reviews/jdk/make/Init.gmk:323: main] Error 2
make: *** [/home/jatinbha/sandboxes/jdk-reviews/jdk/make/Init.gmk:189: images] Error 2

Feb 27 '24 07:02 jatin-bhateja

@asgibbons This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8320448: Accelerate IndexOf using AVX2

Reviewed-by: epeter, kvn, sviswanathan

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 263 new commits pushed to the master branch:

d744059b5b3e944bee53536de6f404666e45e8e5: 8333774: Avoid eagerly loading various EmptySpliterator classes
d130d2f4f46d37a2b924343de19d012c129b0a55: 8333477: Delete extra empty spaces in Makefiles
486dee2cf420981b4c8111c24c5fbd27aceb238b: 8333653: Remove MallocHeader::get_stack
40b2fbd8207404961d3d23375b288cceafc3f902: 8331733: [PPC64] saving and restoring CR is not needed at most places
6968770b1e918c74fc009e3562a827bb4acbe2d7: 8331935: Add support for primitive array C1 clone intrinsic in PPC
a2030fff9833aba40e8c7c177151a30a0812a250: 8332516: Serial: Always sample promoted bytes to avoid getting stuck in Full GCs
bf7f1c41cc2a2b98775301bc377a4c6e1340a736: 8333211: NMT Reports: replace manual indentation handling with auto indent
8ffc35d117846a7a2aa08afed662273d2f887770: 8333724: Problem list security/infra/java/security/cert/CertPathValidator/certification/CAInterop.java#teliasonerarootcav1
f7862bd6b9994814c6dfd43d471122408601f288: 8331311: C2: Big Endian Port of 8318446: optimize stores into primitive arrays by combining values into larger store
b4beda21b487886b022e04766e140e6d1df1038a: 8332537: C2: High memory usage reported for compiler/loopopts/superword/TestAlignVectorFuzzer.java
... and 253 more: https://git.openjdk.org/jdk/compare/37c477856d543163b60dd2b85a5e6ac35a752211...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@eme64, @vnkozlov, @sviswa7) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Mar 13 '24 20:03 openjdk[bot]

Re-opening this PR after some insightful comments, which resulted in a re-design. Now ready for review.

Now showing ~1.6x performance gain on original StringIndexOf benchmark. I added a benchmark (StringIndexOfHuge) that measures performance on large-ish strings (e.g., 34-byte substring within a 2052-byte string). This benchmark performed on average ~14x original.

Sorry for the large change, but it couldn't be done piecemeal.

Mar 23 '24 02:03 asgibbons

Hi, in Netty, we have our own AsciiString::indexOf based on SWAR techniques, which is manually loop unrolling the head processing (first < 8 bytes) to artificially make sure the branch predictor got different branches to care AND JIT won't make it wrong. We have measured (I can provide a link of the benchmark and results, If you are interested) that it delivers a much better performance on tiny strings and makes a smoother degradation vs perfectly aligned string length as well. Clearly this tends to be much visible if the input strings have shuffled delimiter positions, to make the branch prediction misses more relevant.

Mar 23 '24 09:03 franz1981

Hi @franz1981. I'd be interested in seeing the code. I looked here, but that just looks like a naïve implementation, so I must be missing something.

This code uses vector compares to search for the first byte of the substring (needle) in 32-byte chunks of the input string (haystack). However, once a character is found, it also checks for a match corresponding to the last byte of the needle within the haystack before doing the full needle comparison. This is also in 32-byte chunks. That is, we load a vector register with 32 (or 16 for wide chars) copies of the first byte of the needle, and another with copies of the last byte of the needle. The first comparison is done at the start of the haystack, giving us indication of the presence and index of the first byte. We then compare the last byte of the needle at the haystack indexed at needle length - 1 (i.e., the last byte). This tells us if the last byte of the needle appears in the correct relative position within the haystack. ANDing these results tells us whether or not we have a candidate needle within the haystack, as well as the position of the needle. Only then do we do a full char-by-char comparison of the needle to the haystack (this is also done with vector instructions when possible).

A lot of this code is there to handle the cases of small-ish needles within small-ish haystacks, where 32-byte reads are not possible (due to reading past the end of the strings, possibly generating a page fault). I handle less than 32-byte haystacks by copying the haystack to the stack, where I can be assured that 32-byte reads will be possible. So there are special cases for haystack < 32 bytes with needle sizes <= 10 (arbitrary) and one for haystacks > 32 bytes and needle size <= 10. I also added a section for haystacks <= 32 bytes and needle sizes < 5, which seem to be the most common cases. This path copies the haystack to the stack (a single vector read & write) and up to 5 vector comparisons, one for each byte of the needle, with no branching or looping.

I'd be very interested in seeing the Netty SWAR implementation. Thanks for the comment.

Mar 23 '24 16:03 asgibbons

Sure thing: https://github.com/netty/netty/pull/13534#issuecomment-1685247165

It's the comparison with String::indexOf while the impl is: https://github.com/netty/netty/blob/3a3f9d13b129555802de5652667ca0af662f554e/buffer/src/main/java/io/netty/buffer/ByteBufUtil.java#L590

This one is related an indexOf(byte) kind of search, not indexOf(byte[]}): for the latter we use a classic two way search.

Mar 23 '24 18:03 franz1981

Comment on behalf of @sviswa7 : Unclear whether size in byte_compare_helper is intended to be in bytes or in elements. Please check its consistency.

May 22 '24 02:05 asgibbons

/contributor add @vpaprotsk /contributor add @sviswa7

May 22 '24 16:05 asgibbons

@asgibbons @vpaprotsk was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

May 22 '24 16:05 openjdk[bot]

@asgibbons Contributor Sandhya Viswanathan <[email protected]> successfully added.

May 22 '24 16:05 openjdk[bot]

/contributor remove @sviswa7 By her request

May 22 '24 16:05 asgibbons

/contributor add @vpaprots

May 22 '24 16:05 asgibbons

@asgibbons Contributor Sandhya Viswanathan <[email protected]> successfully removed.

May 22 '24 16:05 openjdk[bot]

@asgibbons @vpaprots was not found in the census.

Syntax: /contributor (add|remove) [@user | openjdk-user | Full Name <email@address>]. For example:

/contributor add @openjdk-bot
/contributor add duke
/contributor add J. Duke <[email protected]>

User names can only be used for users in the census associated with this repository. For other contributors you need to supply the full name and email address.

May 22 '24 16:05 openjdk[bot]

I submitted our testing for latest v34 version of changes.

May 23 '24 23:05 vnkozlov

My testing for v34 passed without new failures.

May 24 '24 17:05 vnkozlov

Thank you @vnkozlov . Waiting for review from @sviswa7 and @jatin-bhateja, then I'll integrate.

May 24 '24 18:05 asgibbons

Thank you all for the comments. If there are no objections, I'll integrate these fixes tomorrow morning.

I've run tier1-3 tests with the appropriate options on my machine with no errors, so my confidence is high.

May 29 '24 16:05 asgibbons

Let me test the latest version before integration.

May 29 '24 21:05 vnkozlov

jdk jdk copied to clipboard

8320448: Accelerate IndexOf using AVX2

Progress

Issue

Reviewing

Webrev

Webrevs

jdk
jdk copied to clipboard