RepoSense
RepoSense copied to clipboard
Investigation into the reduction of runtime and memory usage for `StreamGobbler`
What feature(s) would you like to see in RepoSense
As detailed in issue #2091, the StreamGobbler
class consumes a large amount of memory when in use, upwards of around > 500 MB per run.
After some digging around through the codebase, and looking through the different source codes for String
and StringBuilder
class, it appears that there might be some performance bottlenecks with the way that the code is currently written.
Currently, the code is implemented as such:
ReadableByteChannel ch = Channels.newChannel(is);
int len;
while ((len = ch.read(buffer)) > 0) {
sb.append(new String(buffer.array(), 0, len));
buffer.rewind();
}
value = sb.toString();
We can observe that a new String
is created for every 8 KB of data read into the buffer and that the string is subsequently appended with the other strings stored in StringBuilder
before the buffer is rewound and overwritten in the next file read operation.
After reading through the String
API, I noticed that the creation of a new string from the buffer array possibly creates a new copy of the array with Arrays::copyOf
or Arrays::copyOfRange
within the StringCoding
class, which handles the decoding of String
objects.
Moreover, the String appending process for StringBuilder
could possibly make a call to the AbstractStringBuilder::getBytes
method, which makes another call to System::arraycopy
.
The combination of both method calls means that repeated work may have been done, first to copy the byte buffer into the byte array contained in a String
, and thereafter to copy the byte array in the String
out into the internal byte array of StringBuilder
.
This repeated work, as well as the creation of multiple String
objects (which could be problematic when the files are huge, since each String
object can only contain at most 8 KB of data from the file), could result in a significant decrease in runtime performance (also possibly from garbage collection) and an increase in (heap) memory usage.
We could look into finding new ways to read all data in an input stream and avoid repeated work to improve both runtime and memory performance.
Is the feature request related to a problem?
This issue is not related to a problem, but it is related to the overall goal of making RepoSense more performant.
If possible, describe the solution
Currently, I am unable to find a solution that works sufficiently well. Improving memory performance necessarily means that runtime performance would degrade and vice versa.
Some resources that we might wish to take a look at would be this. I have tried out using BufferedReader
from the guide and it seems to reduce the overall runtime and memory usage but it seems that it is occasionally failing test cases and system test cases.
Here is the result of one of the profiling runs:
The overall runtime and memory usage were lower compared to the improvements made in #2091.
The code tested is as follow:
StringBuilder sb = new StringBuilder();
try (BufferedReader streamReader = new BufferedReader(new InputStreamReader(is))) {
int c;
while ((c = streamReader.read()) != -1) {
sb.append((char) c);
}
value = sb.toString();
}
If applicable, describe alternatives you've considered
Currently, no other alternatives have been considered.
Additional context
N/A