RepoSense Investigation into the reduction of runtime and memory usage for `StreamGobbler`

Investigation into the reduction of runtime and memory usage for `StreamGobbler`

Open georgetayqy opened this issue 1 year ago • 0 comments

What feature(s) would you like to see in RepoSense

As detailed in issue #2091, the StreamGobbler class consumes a large amount of memory when in use, upwards of around > 500 MB per run.

After some digging around through the codebase, and looking through the different source codes for String and StringBuilder class, it appears that there might be some performance bottlenecks with the way that the code is currently written.

Currently, the code is implemented as such:

ReadableByteChannel ch = Channels.newChannel(is);
int len;
while ((len = ch.read(buffer)) > 0) {
    sb.append(new String(buffer.array(), 0, len));
    buffer.rewind();
}
value = sb.toString();

We can observe that a new String is created for every 8 KB of data read into the buffer and that the string is subsequently appended with the other strings stored in StringBuilder before the buffer is rewound and overwritten in the next file read operation.

After reading through the String API, I noticed that the creation of a new string from the buffer array possibly creates a new copy of the array with Arrays::copyOf or Arrays::copyOfRange within the StringCoding class, which handles the decoding of String objects.

Moreover, the String appending process for StringBuilder could possibly make a call to the AbstractStringBuilder::getBytes method, which makes another call to System::arraycopy.

The combination of both method calls means that repeated work may have been done, first to copy the byte buffer into the byte array contained in a String, and thereafter to copy the byte array in the String out into the internal byte array of StringBuilder.

This repeated work, as well as the creation of multiple String objects (which could be problematic when the files are huge, since each String object can only contain at most 8 KB of data from the file), could result in a significant decrease in runtime performance (also possibly from garbage collection) and an increase in (heap) memory usage.

We could look into finding new ways to read all data in an input stream and avoid repeated work to improve both runtime and memory performance.

Is the feature request related to a problem?

This issue is not related to a problem, but it is related to the overall goal of making RepoSense more performant.

If possible, describe the solution

Currently, I am unable to find a solution that works sufficiently well. Improving memory performance necessarily means that runtime performance would degrade and vice versa.

Some resources that we might wish to take a look at would be this. I have tried out using BufferedReader from the guide and it seems to reduce the overall runtime and memory usage but it seems that it is occasionally failing test cases and system test cases.

Here is the result of one of the profiling runs:

The overall runtime and memory usage were lower compared to the improvements made in #2091.

The code tested is as follow:

StringBuilder sb = new StringBuilder();

try (BufferedReader streamReader = new BufferedReader(new InputStreamReader(is))) {
    int c;

    while ((c = streamReader.read()) != -1) {
        sb.append((char) c);
    }

    value = sb.toString();
}

If applicable, describe alternatives you've considered

Currently, no other alternatives have been considered.

Additional context

N/A

Jan 23 '24 11:01 georgetayqy

RepoSense RepoSense copied to clipboard

Investigation into the reduction of runtime and memory usage for `StreamGobbler`

What feature(s) would you like to see in RepoSense

Is the feature request related to a problem?

If possible, describe the solution

If applicable, describe alternatives you've considered

Additional context

RepoSense
RepoSense copied to clipboard