jdk icon indicating copy to clipboard operation
jdk copied to clipboard

8358533: Improve performance of java.io.Reader.readAllLines

Open bplb opened this issue 5 months ago • 9 comments
trafficstars

Replaces the implementation readAllCharsAsString().lines().toList() with reading into a temporary char array which is then processed to detect line terminators and copy non-terminating characters into strings which are added to the list.


Progress

  • [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • [x] Change must not contain extraneous whitespace
  • [x] Commit message must refer to an issue

Issue

  • JDK-8358533: Improve performance of java.io.Reader.readAllLines (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25863/head:pull/25863
$ git checkout pull/25863

Update a local copy of the PR:
$ git checkout pull/25863
$ git pull https://git.openjdk.org/jdk.git pull/25863/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 25863

View PR using the GUI difftool:
$ git pr show -t 25863

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25863.diff

Using Webrev

Link to Webrev Comment

bplb avatar Jun 18 '25 00:06 bplb

The throughput of the implementation as measured by the included benchmark appears to hover around 13% greater than that of the existing method. The updated method should also have a smaller memory footprint for streams of non-trivial length as it does not first create a single intermediate String containing all lines in the stream. Instead it uses a char array of size 8192 and a StringBuilder whose maximum length will be the length of the longest line in the input.

bplb avatar Jun 18 '25 00:06 bplb

:wave: Welcome back bpb! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

bridgekeeper[bot] avatar Jun 18 '25 00:06 bridgekeeper[bot]

@bplb This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8358533: Improve performance of java.io.Reader.readAllLines

Reviewed-by: rriggs, sherman

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 185 new commits pushed to the master branch:

  • 25ed36f3ef1fe1d6914689c762910f104775f48c: 8359493: Refactor how aggregated mandatory warnings are handled in the compiler
  • a2315ddd2a343ed594dd1b0b3d0dc5b3a71f509b: 8357739: [jittester] disable the hashCode method
  • 66836d40b80f9c5482c1322d1d07f078ad9dcc02: 8361292: Rename ModuleEntry::module() to module_oop()
  • ... and 182 more: https://git.openjdk.org/jdk/compare/96070212adfd15acd99edf6e180db6228ee7b4ff...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk[bot] avatar Jun 18 '25 00:06 openjdk[bot]

@bplb The following label will be automatically applied to this pull request:

  • core-libs

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

openjdk[bot] avatar Jun 18 '25 00:06 openjdk[bot]

If we want better performance, we should go a step further and overload the readAllLines method in the Reader implementation class.

For example, in the most commonly used InputStreamReader, overload readAllLines through StreamDecoder and make special optimizations for UTF8/ISO_8859_1 encoding.

In StringReader, special overload methods can also be used for optimization.

wenshao avatar Jun 18 '25 09:06 wenshao

If we want better performance, we should go a step further and overload the readAllLines method in the Reader implementation class.

Perhaps, but not in this request. A separate issue should be filed and addressed subsequently.

bplb avatar Jun 18 '25 14:06 bplb

The commit d5abfa4 does not address most comments provided to date. The algorithm was wrong and I preferred to correct it first.

bplb avatar Jun 24 '25 14:06 bplb

The throughput of the implementation [...].

The performance comments made here still apply to the most recent commit.

bplb avatar Jun 24 '25 19:06 bplb

It seems the only additional info needed to handle resizing and line fragment preservation is the start position of the current line. We might be able to consolidate two nested parsing loops into a single loop as well, which should improve/speed up the iteration performance. The modified code below is a PoC for reference. It seems to work as expected, though not fully tested :-)

public List<String> readAllLines() throws IOException {
    List<String> lines = new ArrayList<>();
    char[] cb = new char[TRANSFER_BUFFER_SIZE];

    int start = 0;
    int pos = 0;
    int limit = 0;
    boolean skipLF = false;
    int n;
    while ((n = read(cb, pos, cb.length - pos)) != -1) {
        limit = pos + n;
        while (pos < limit) {
            if (skipLF ) {
                if (cb[pos] == '\n') {
                    pos++;
                    start++;
                }
                skipLF = false;
                if (pos == limit)
                    break;
            }
            char c = cb[pos++];
            if (c == '\n' || c == '\r') {
                lines.add(new String(cb, start, pos - 1 - start));
                skipLF =  (c == '\r');
                start = pos;
            }
        }
        int len = limit - start;
        if (len >= cb.length) {
            // allocate larger buffer and copy chars to beginning
            char[] tmp = new char[2*cb.length];
            System.arraycopy(cb, start, tmp, 0, len);
            cb = tmp;
        } else if (start != 0 && len != 0) {
            // move fragment to beginning of buffer
            System.arraycopy(cb, start, cb, 0, len);
        }
        pos = limit = len;
        start = 0;
    }
    // add a string if EOS terminates the last line
    if (limit > start)
        lines.add(new String(cb, start, limit - start));

    return Collections.unmodifiableList(lines);
}

xuemingshen-oracle avatar Jun 28 '25 08:06 xuemingshen-oracle

The modified code below is a PoC for reference. It seems to work as expected, though not fully tested.

This looks like a nice reworking, and appears to pass the regression test, but its throughput is less than half that of the current version in the request (3e5e498), I don't as yet know why.

bplb avatar Jun 30 '25 16:06 bplb

seems i was wrong about "consolidate two nested parsing loops" :-) the throughput appears to back to normal with 'find term while loop' fast path .

    int start = 0;
    int pos = 0;
    int limit = 0;
    boolean skipLF = false;
    int n;
    while ((n = read(cb, pos, cb.length - pos)) != -1) {
        limit = pos + n;
        while (pos < limit) {
            if (skipLF) {
                if (cb[pos] == '\n') {
                    pos++;
                    start++;
                }
                skipLF = false;
            }
            while (pos < limit) {
                char c = cb[pos++];
                if (c == '\n' || c == '\r') {
                    lines.add(new String(cb, start, pos - 1 - start));
                    skipLF = (c == '\r');
                    start = pos;
                    break;
                }
            }
            if (pos == limit) {
                int len = limit - start;
                if (len >= cb.length) {
                    // allocate larger buffer and copy chars to beginning
                    char[] tmp = new char[2 * cb.length];
                    System.arraycopy(cb, start, tmp, 0, len);
                    cb = tmp;
                } else if (start != 0 && len != 0) {
                    // move fragment to beginning of buffer
                    System.arraycopy(cb, start, cb, 0, len);
                }
                pos = limit = len;
                start = 0;
                break;
            }
        }
    }
    // add a string if EOS terminates the last line
    if (limit > start)
        lines.add(new String(cb, start, limit - start));

    return Collections.unmodifiableList(lines);
}

xuemingshen-oracle avatar Jun 30 '25 18:06 xuemingshen-oracle

the throughput appears to back to normal with 'find term while loop' fast path

This version passes the regression test and is a little faster than the latest PR version. I'll run it through the CI and probably use it in the next commit. Thanks, @xuemingshen-oracle !

bplb avatar Jun 30 '25 18:06 bplb

/reviewers 2 reviewer

Another review would be good too.

RogerRiggs avatar Jul 03 '25 13:07 RogerRiggs

@RogerRiggs The total number of required reviews for this PR (including the jcheck configuration and the last /reviewers command) is now set to 2 (with at least 2 Reviewers).

openjdk[bot] avatar Jul 03 '25 13:07 openjdk[bot]

Thanks @RogerRiggs, @xuemingshen-oracle for the approvals. I will hold off integrating until next week.

bplb avatar Jul 03 '25 18:07 bplb

/integrate

bplb avatar Jul 09 '25 16:07 bplb

Going to push as commit 6e203384f8777fc55081065b128bd2b0ba074729. Since your change was applied there have been 257 commits pushed to the master branch:

  • 6249259c8050f280fb1c489e816f09d5cd72a54b: 8361299: (bf) CharBuffer.getChars(int,int,char[],int) violates pre-existing specification
  • a41d35073ee6da0dde4dd731c1ab4c25245d075a: 8357473: Compilation spike leaves many CompileTasks in free list
  • 7daf9813c0617ea97d95bf326eac1758e40cddd6: 8346884: Add since checker test to jdk.editpad
  • ... and 254 more: https://git.openjdk.org/jdk/compare/96070212adfd15acd99edf6e180db6228ee7b4ff...master

Your commit was automatically rebased without conflicts.

openjdk[bot] avatar Jul 09 '25 16:07 openjdk[bot]

@bplb Pushed as commit 6e203384f8777fc55081065b128bd2b0ba074729.

:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

openjdk[bot] avatar Jul 09 '25 16:07 openjdk[bot]