jdk
jdk copied to clipboard
6478546: FileInputStream.read() throws OutOfMemoryError when there is plenty available
Modify native multi-byte read-write code used by the java.io
classes to limit the size of the allocated native buffer thereby decreasing off-heap memory footprint and increasing throughput.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-6478546: FileInputStream.read() throws OutOfMemoryError when there is plenty available
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk pull/8235/head:pull/8235
$ git checkout pull/8235
Update a local copy of the PR:
$ git checkout pull/8235
$ git pull https://git.openjdk.org/jdk pull/8235/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 8235
View PR using the GUI difftool:
$ git pr show -t 8235
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/8235.diff
:wave: Welcome back bpb! A progress list of the required criteria for merging this PR into master
will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
Currently for java.io.FileInputStream.read(byte[],int,int)
and java.io.FileOutputStream.write(byte[],int,int)
, for example, if the number of bytes respectively to be read or written exceeds 8192, an array of the total length of the read or write is allocated. For large reads or writes this could be significant. It is proposed to limit the maximum allocation size to 1 MB and perform the read or write by looping with this buffer. For reading or writing less than 1 MB, there is no effective change in the implementation.
This change both saves off-heap memory and increases throughput. An allocation of 1 MB is only 0.42% the size of the buffer in the JBS issue, 501 x 501 x 501 x 2 (= 251,503,002), so for this case the memory reduction is drastic. Reading throughput is almost doubled and writing throughput improved by about 50%. As measured on macOS, the throughput of the methods mentioned above before the change was:
Benchmark Mode Cnt Score Error Units
ReadWrite.read thrpt 5 10.108 ± 0.264 ops/s
ReadWrite.write thrpt 5 7.188 ± 0.431 ops/s
and that after is:
Benchmark Mode Cnt Score Error Units
ReadWrite.read thrpt 5 20.112 ± 0.262 ops/s
ReadWrite.write thrpt 5 10.644 ± 4.568 ops/s
@bplb The following label will be automatically applied to this pull request:
-
core-libs
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Webrevs
- 08: Full - Incremental (2532c0b3)
- 07: Full - Incremental (9d7e4fbf)
- 06: Full - Incremental (9a138636)
- 05: Full - Incremental (e6f3b457)
- 04: Full - Incremental (00521485)
- 03: Full - Incremental (10f14bb3)
- 02: Full - Incremental (5c3a3446)
- 01: Full - Incremental (40d46f8f)
- 00: Full (8bb1e14f)
The benchmark results are quite unexpected. Would we benefit from reducing the buffer size even further?
The benchmark results are quite unexpected. Would we benefit from reducing the buffer size even further?
I tested with smaller and smaller buffer sizes until the performance started to be affected which was about 64 KB. I have not changed this value in the patch just yet.
By the way: FileOutputStream has exactly the same problem with write(byte[])
. I see no test for it, but I assume this is now also fixed. That's a longstanding issue in Lucene (this is why we use a ChunkedOutputStream when writing files.
By the way: FileOutputStream has exactly the same problem with
write(byte[])
. I see no test for it, but I assume this is now also fixed. That's a longstanding issue in Lucene (this is why we use a ChunkedOutputStream when writing files.
Yes, write(byte[])
is also fixed: io_util.c#L164.
Further performance testing was conducted for the case where the native read and write functions used a fixed, stack-allocated buffer of size 8192. The loops were moved up into the Java code of FileInputStream
, FileOutputStream
and RandomAccessFile
. Note that there was code duplication because RAF needs both read and write methods as well. The performance of writing with this approach was approximately half what it had been, so for writing the approach was abandoned.
Here are some updated performance measurements:


The performance measurements shown are for the following cases:
- Master: unmodified code as it exists in the mainline
- Java: fixed-size stack buffer in native read, read loops in Java, write as in the mainline but with malloc buffer size limit
- Native: read loop in native read with malloc buffer size limit, write as in the mainline but with malloc buffer size limit
The horizontal axis represents a variety of lengths from 8192 to 1GB; the vertical axis is throughput (ops/s) on a log 10 scale. The native lines in the charts are for the code proposed to be integrated.
As can be seen, the performance of reading is quite similar up to larger lengths. The mainline version presumably starts to suffer the effect of large allocations. The native read loop performs the best throughout, being for lengths 10 MB and above from 50% to 3X faster than the mainline version. The native read loop is about 40% faster than the Java read loop for these larger lengths.
Due to the log scale of the charts, the reading performance detail cannot be seen exactly and so is given here for the larger lengths:
Throughput of read(byte[]) (ops/s)
Length Master Java Native
1048576 11341.39 6124.482 11371.091
10485760 356.893 376.326 557.906
251503002 10.036 14.27 19.869
524288000 5.005 6.857 9.552
1000000000 1.675 3.527 4.997
The performance of writing is about the same for the Java and Native versions, as it should be since the implementations are the same. Any difference is likely due to measurement noise. The mainline version again suffers for larger lengths.
As the native write loop was already present in the mainline code, the principal complexity proposed to be added is the native read loop. Given the improved throughput and vastly reduced native memory allocation this seems to be justified.
This change would increase throughput at the expense of the additional complexity of a new loop in the native readBytes()
function; a loop was already present in the native writeBytes()
function. Putting the loop in Java was problematic and less improvement in performance was observed. I suggest at this point either refining the current proposal as needed to move it forward, or withdrawing the request and resolving the associated issue as Will not Fix.
The following benchmark results are for commit e6f3b4574b1c49955e5acaca618a3083eb132e05 with 5 warmup iterations of 5 seconds each, followed by 10 measurement iterations of 10 seconds each on a MacBookPro16,1.
FileInputStream.read(byte[])
Score (ops/s)
(length) Master Branch
16384 5850.383 ± 220.209 5805.952 ± 187.870
32768 5781.410 ± 345.939 5807.587 ± 515.808
100000 5444.167 ± 409.680 5273.988 ± 159.652
500000 3748.413 ± 88.505 3803.577 ± 256.442
1000000 2647.356 ± 145.245 2700.523 ± 97.717
1048576 2759.337 ± 362.618 2631.779 ± 45.971
10485760 338.840 ± 2.706 392.126 ± 2.59
251503002 10.626 ± 0.024 20.165 ± 0.091
524288000 5.134 ± 0.024 9.826 ± 0.068
1000000000 1.806 ± 0.014 5.194 ± 0.015
FileOutputStream.write(byte[])
Score (ops/s)
(length) Master Branch
16384 2736.489 ± 31.547 2599.203 ± 79.392
32768 2656.471 ± 72.597 2523.470 ± 158.460
100000 2476.514 ± 46.392 2225.597 ± 442.285
500000 1606.205 ± 487.787 1281.550 ± 413.797
1000000 1125.265 ± 13.969 812.184 ± 232.23
1048576 1098.608 ± 13.528 816.657 ± 218.112
10485760 136.867 ± 2.310 153.137 ± 2.93
251503002 7.266 ± 0.026 10.546 ± 0.125
524288000 3.450 ± 0.016 3.880 ± 1.355
1000000000 1.178 ± 0.007 2.637 ± 0.505
The proposed change is approximately at parity with the current code base for smaller lengths, but has higher throughput for larger lengths, at least for reading. In all cases where the length is greater than or equal to 65536 (64 KiB) or is a multiple of 8192 (8 KiB), less native memory is allocated.
Using the temporary direct buffer cache to provide intermediate native memory is not a good solution as direct memory may be limited. Hence the patch is reverted to its prior state, modified to move parameter checking up to the Java layer, and simplified to keep Blocker
use to within the readBytes
and writeBytes
methods. This patch uses less memory and has generally higher throughput than the master branch. Sample benchmark measurements on macOS are as follows:
RandomAccessFile::read(byte[])
ops / sec
length master patch
16384 468119.718 501174.185
32768 328305.094 353844.578
262144 48575.747 52739.050
1048576 13176.250 13723.486
10485760 378.469 596.246
251503002 10.554 20.784
524288000 5.095 10.014
1000000000 1.743 5.254
RandomAccessFile::write(byte[])
ops / sec
length master patch
16384 328584.712 353691.078
32768 252836.010 276133.188
262144 46072.526 38558.946
1048576 12908.426 9716.200
10485760 194.396 498.384
251503002 7.649 18.175
524288000 3.643 8.524
1000000000 1.202 4.491
@bplb This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
Moving to draft.
@bplb This pull request has been inactive for more than 8 weeks and will be automatically closed if another 8 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!
@bplb This pull request has been inactive for more than 16 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it! This can be done using the /open
pull request command.