jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8369564: Provide a MemorySegment API to read strings with known lengths

Open cushon opened this issue 1 month ago • 26 comments

This PR proposes adding a new overload to MemorySegment::getString that takes a known byte length of the content.

This was previously proposed in https://github.com/openjdk/jdk/pull/20725, but the outcome of JDK-8333843 was to update MemorySegment#getString to suggest

    byte[] bytes = new byte[length];
    MemorySegment.copy(segment, JAVA_BYTE, offset, bytes, 0, length);
    return new String(bytes, charset);

However this is less efficient than what the implementation of getString does after JDK-8362893, it now uses JavaLangAccess::uncheckedNewStringNoRepl to avoid the copy.

See also discussion in this panama-dev@ thread, and mcimadamore's document Pulling the (foreign) string

Benchmark results:

Benchmark                                 (size)  Mode  Cnt    Score   Error  Units
ToJavaStringTest.jni_readString                5  avgt   30   55.339 ± 0.401  ns/op
ToJavaStringTest.jni_readString               20  avgt   30   59.887 ± 0.295  ns/op
ToJavaStringTest.jni_readString              100  avgt   30   84.288 ± 0.419  ns/op
ToJavaStringTest.jni_readString              200  avgt   30  119.275 ± 0.496  ns/op
ToJavaStringTest.jni_readString              451  avgt   30  193.106 ± 1.528  ns/op
ToJavaStringTest.panama_copyLength             5  avgt   30    7.348 ± 0.048  ns/op
ToJavaStringTest.panama_copyLength            20  avgt   30    7.440 ± 0.125  ns/op
ToJavaStringTest.panama_copyLength           100  avgt   30   11.766 ± 0.058  ns/op
ToJavaStringTest.panama_copyLength           200  avgt   30   16.096 ± 0.089  ns/op
ToJavaStringTest.panama_copyLength           451  avgt   30   25.844 ± 0.054  ns/op
ToJavaStringTest.panama_readString             5  avgt   30    5.857 ± 0.046  ns/op
ToJavaStringTest.panama_readString            20  avgt   30    7.750 ± 0.046  ns/op
ToJavaStringTest.panama_readString           100  avgt   30   14.109 ± 0.187  ns/op
ToJavaStringTest.panama_readString           200  avgt   30   18.035 ± 0.130  ns/op
ToJavaStringTest.panama_readString           451  avgt   30   35.896 ± 0.227  ns/op
ToJavaStringTest.panama_readStringLength       5  avgt   30    4.565 ± 0.038  ns/op
ToJavaStringTest.panama_readStringLength      20  avgt   30    4.654 ± 0.040  ns/op
ToJavaStringTest.panama_readStringLength     100  avgt   30    8.502 ± 0.207  ns/op
ToJavaStringTest.panama_readStringLength     200  avgt   30   10.950 ± 0.124  ns/op
ToJavaStringTest.panama_readStringLength     451  avgt   30   16.244 ± 0.135  ns/op

Progress

  • [ ] Change requires CSR request JDK-8372338 to be approved
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue
  • [ ] Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issues

  • JDK-8369564: Provide a MemorySegment API to read strings with known lengths (Enhancement - P4)
  • JDK-8372338: Provide a MemorySegment API to read strings with known lengths (CSR)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28043/head:pull/28043
$ git checkout pull/28043

Update a local copy of the PR:
$ git checkout pull/28043
$ git pull https://git.openjdk.org/jdk.git pull/28043/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28043

View PR using the GUI difftool:
$ git pr show -t 28043

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28043.diff

Using Webrev

Link to Webrev Comment

cushon avatar Oct 29 '25 10:10 cushon

/contributor add @minborg

cushon avatar Oct 29 '25 10:10 cushon

:wave: Welcome back cushon! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Oct 29 '25 10:10 bridgekeeper[bot]

❗ This change is not yet ready to be integrated. See the Progress checklist in the description for automated requirements.

openjdk[bot] avatar Oct 29 '25 10:10 openjdk[bot]

@cushon Contributor Per Minborg <[email protected]> successfully added.

openjdk[bot] avatar Oct 29 '25 10:10 openjdk[bot]

@cushon The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Oct 29 '25 10:10 openjdk[bot]

@cushon security has been added to this pull request based on files touched in new commit(s).

openjdk[bot] avatar Nov 19 '25 08:11 openjdk[bot]

/label remove security

wangweij avatar Nov 19 '25 15:11 wangweij

@wangweij The security label was successfully removed.

openjdk[bot] avatar Nov 19 '25 15:11 openjdk[bot]

Thanks for the review!

I think we should put some more care when going to char-based indices to byte array indices, esp. if we will optimize the UTF16 case in the future

Thanks for catching that. I have made some initial updates and added an assertion. To confirm do you think it's OK to leave optimizing the UTF16 case as future work, as long as the current assumptions are clearly documented and guarded by assertions?

Also, I'm planning to spend more time on test coverage for this and going over the javadoc again.

cushon avatar Nov 20 '25 08:11 cushon

/csr needed

mcimadamore avatar Nov 20 '25 11:11 mcimadamore

@mcimadamore has indicated that a compatibility and specification (CSR) request is needed for this pull request.

@cushon please create a CSR request for issue JDK-8369564 with the correct fix version. This pull request cannot be integrated until the CSR request is approved.

openjdk[bot] avatar Nov 20 '25 11:11 openjdk[bot]

To confirm do you think it's OK to leave optimizing the UTF16 case as future work, as long as the current assumptions are clearly documented and guarded by assertions?

I think this is ok, yes -- maybe add a link (comment) between the assertion being thrown and String::bytesCompatible.

mcimadamore avatar Nov 20 '25 11:11 mcimadamore

/reviewers 2

mcimadamore avatar Nov 20 '25 11:11 mcimadamore

@mcimadamore The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 1 Reviewer, 1 Author).

openjdk[bot] avatar Nov 20 '25 11:11 openjdk[bot]

I'd like to suggest some additional changes:

I updated the warnings about truncated reads of strings containing \0 to mention MemorySegment#getString(long, Charset, long) (see discussion here: https://github.com/openjdk/jdk/pull/28043#discussion_r2549972828)

I realized StringSupport#copyBytes was returning the wrong number of written bytes (string.length() vs. numChars), but also the return value wasn't being used anywhere.

I consolidated the copies of copyBytes in StringSupport. This means that for the case that doesn't need srcIndex/numChars there's a call to String#substring, but substring has a fast path to return the entire string, and I think that probably isn't worth micro-optimizing away.

I have tentatively updated the new MemorySegment#copy to return the number of bytes that were written. As we have discussed, the encoded length of a string isn't trivial to compute. Since MemorySegment#copy has to find out the length anyways, I think this is valuable information to return to the caller. The uses of the underlying copyBytes method in StringSupport are evidence of this, it needs to know the number of bytes written to write the null terminator at the correct position. Returning a length from the new copy method isn't consistent with existing overloads of copy, though. For many of those the number of bytes written is obvious, so although it's inconsistent I think it's defensible. What do you think?

cushon avatar Nov 24 '25 09:11 cushon

I have tentatively updated the new MemorySegment#copy to return the number of bytes that were written.

I think I understand the use case. E.g. say a client wants to write a bunch of variable length string to a segment, where the string is prefixed by length (a la protobuf). So what they can do is:

  • write length at offset S
  • write string at offset S + 4, this writes N bytes
  • write next length at offset S + 4 + N
  • write next string at offset S + 4 + N + 4, this writes M bytes
  • ...

Effectively, this unifies copy with setString -- in the sense that now setString is just a string-based copy + a terminator write (at the correct offset) -- the size of the terminator is charset dependent.

I think overall it makes sense -- I believe @JornVernee proposed something similar in the past. It's true it's inconsistent with other copy method and, in hindsight, it would have perhaps been useful to always return a length (given that in other cases length is expressed in "elements" -- either array elements, or elements expressed in a given layout) -- although in the other cases the copy length can usually be computed using a shift operation.

mcimadamore avatar Nov 24 '25 15:11 mcimadamore

I think overall it makes sense -- I believe @JornVernee proposed something similar in the past. It's true it's inconsistent with other copy method and, in hindsight, it would have perhaps been useful to always return a length (given that in other cases length is expressed in "elements" -- either array elements, or elements expressed in a given layout) -- although in the other cases the copy length can usually be computed using a shift operation.

Also, I believe the need for returning a copy length is probably driven by the fact that we have dropped the MemorySegment::ofString(String, Charset) view that we originally anticipated. If we had that method, returning a length wouldn't have been as important -- although perhaps, to compensate fully, we would also have needed a view that took a char index and a char length -- e.g. to view a substring as a memory segment.

mcimadamore avatar Nov 24 '25 15:11 mcimadamore

I have tentatively updated the new MemorySegment#copy to return the number of bytes that were written.

Also, if in the future we also enhance encoders to return the length of a string w/o encoding it, it might still be useful to have copy return the length (to avoid scanning the string chars twice).

mcimadamore avatar Nov 24 '25 15:11 mcimadamore

I think returning the amount of copied bytes makes sense. The amount of bytes that is written is something that is currently hidden inside the implementation of copy, and the user has to guess how many bytes are written.

JornVernee avatar Nov 24 '25 18:11 JornVernee

Also, if in the future we also enhance encoders to return the length of a string w/o encoding it, it might still be useful to have copy return the length (to avoid scanning the string chars twice).

I agree, I think there are use-cases for both: if you expect to have enough room in the destination, it's better to avoid scanning the string twice, even if there's a more efficient way to compute the encoded length in the future.

cushon avatar Nov 25 '25 09:11 cushon

Thanks again for the reviews!

I think the next step is to got through the CSR process (and wait for the JDK 27 branch to open). I made some updates to the draft CSR, let me know if you have any feedback or would be willing to review the CSR: https://bugs.openjdk.org/browse/JDK-8372338

cushon avatar Nov 28 '25 09:11 cushon

Thanks again for the reviews!

I think the next step is to got through the CSR process (and wait for the JDK 27 branch to open). I made some updates to the draft CSR, let me know if you have any feedback or would be willing to review the CSR: https://bugs.openjdk.org/browse/JDK-8372338

I've streamlined the text a bit, and added myself as a reviewer. That should be enough to move it through the pipeline.

mcimadamore avatar Nov 28 '25 10:11 mcimadamore

Re CSR:

diff -U 3 a/JDK‑8372338.md b/JDK‑8372338.md
--- a/JDK‑8372338.md
+++ b/JDK‑8372338.md
@@ -15,7 +15,7 @@
 Solution
 --------

-This change adds threw new methods to support efficient handling of non-null terminated strings:
+This change adds three new methods to support efficient handling of non-null terminated strings:

 * `MemorySegment#getString(long offset, Charset charset, long length)`
 * `MemorySegment#copy(String src, Charset dstEncoding, int srcIndex, MemorySegment dst, long dstOffset, int numChars)`
@@ -28,11 +28,11 @@
 2. number of code units
 3. the number of characters in the resulting string

-(3) was was rejected because for variable length encodings it requires a decoding step to convert to bytes for a bulk copy operation. This leaves (1) and (2) as candidates -- since the conversion between the two is a trivial scaling factor, either would have been a viable choice. Code units might be more natural for native strings encoded as an array of code units. Using a byte length was [decided on](https://mail.openjdk.org/pipermail/panama-dev/2025-November/021215.html) to allow supporting arbitrary charsets, since not all charsets may have a concept of a code unit.
+(3) was rejected because for variable length encodings it requires a decoding step to convert to bytes for a bulk copy operation. This leaves (1) and (2) as candidates -- since the conversion between the two is a trivial scaling factor, either would have been a viable choice. Code units might be more natural for native strings encoded as an array of code units. Using a byte length was [decided on](https://mail.openjdk.org/pipermail/panama-dev/2025-November/021215.html) to allow supporting arbitrary charsets, since not all charsets may have a concept of a code unit.

 For `copy` and `allocateFrom`, the `srcIndex` and `numChars` are expressed in terms of character offsets into the string. This is the only practical choice here, since the client already has a Java string, and computing an offset in bytes or code units would require additional computation.

-The new `copy` method is the dual of the new `getString`, and allows writing strings to a target memory segment without a terminator. There was a potential analogy to the existing `MemorySegment#setString` methods here, but they write strings with null terminators. This operation is more in common with the other `copy` overloads, where here a String is the source of data Strings (as opposed to e.g. an array).
+The new `copy` method is the dual of the new `getString`, and allows writing strings to a target memory segment without a terminator. There was a potential analogy to the existing `MemorySegment#setString` methods here, but they write strings with null terminators. This operation is more in common with the other `copy` overloads, where here a String is the source of data (as opposed to e.g. an array).

 Specification
 -------------

ExE-Boss avatar Nov 28 '25 21:11 ExE-Boss

I have tentatively renamed the parameter of getString from length to byteLength, in response to discussion in the CSR

cushon avatar Dec 15 '25 11:12 cushon

I have tentatively renamed the parameter of getString from length to byteLength, in response to discussion in the CSR

sensible move

mcimadamore avatar Dec 15 '25 11:12 mcimadamore