client_java Use of ThreadLocal storage in HTTPServer seems resulting in inefficient memory allocation

Use of ThreadLocal storage in HTTPServer seems resulting in inefficient memory allocation

Open gvisco opened this issue 2 years ago • 15 comments

Disclaimer

I might be missing something in the design of the class (I'm sorry if this is the case), and yet I think it's worth checking and possibly suggest alternative implementations.

TL;DR

The class HTTPServer in client_java/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java seems to allocate more memory than what is actually needed, and for too long.

Description

The class relies, to handle the HTTP requests, on a ByteArrayOutputStream in ThreadLocal storage, that is:

as big as the serialized reply, and allocated 5 times, once per thread in the pool
its content is never re-used, since it is overwritten for each incoming request
never de-allocated or cleaned up after the reply has been delivered

Basically it looks like if the class is implemented to avoid the instantiation of new ByteArrayOutputStream objects and, to achieve this, it pays the maximum cost in terms of memory allocation.

Code walkthrough

The following code is from the HTTPServer class, as it is currently on the master branch, version 0.12.1-SNAPSHOT. The class has a member response of type LocalByteArray, which is a ThreadLocal ByteArrayOutputStream:

private static class LocalByteArray extends ThreadLocal<ByteArrayOutputStream> {
    @Override
    protected ByteArrayOutputStream initialValue()
    {
        return new ByteArrayOutputStream(1 << 20);
    }
}

This is how the class handles the incoming HTTP requests, I've added the comments.

public void handle(HttpExchange t) throws IOException {
    String query = t.getRequestURI().getRawQuery();
    String contextPath = t.getHttpContext().getPath();
    // take, from the thread local storage, the output stream containing the previous response
    // this is the only reference in the code to 'this.response`
    ByteArrayOutputStream response = this.response.get();
    // throw away the previous response
    response.reset();
    // wrap the output stream with a writer
    OutputStreamWriter osw = new OutputStreamWriter(response, Charset.forName("UTF-8"));
    if ("/-/healthy".equals(contextPath)) {
        osw.write(HEALTHY_RESPONSE);
    } else {
        String contentType = TextFormat.chooseContentType(t.getRequestHeaders().getFirst("Accept"));
        t.getResponseHeaders().set("Content-Type", contentType);
        Predicate<String> filter = sampleNameFilterSupplier == null ? null : sampleNameFilterSupplier.get();
        filter = SampleNameFilter.restrictToNamesEqualTo(filter, parseQuery(query));
        // write the reply content into the writer
        if (filter == null) {
            TextFormat.writeFormat(contentType, osw, registry.metricFamilySamples());
        } else {
            TextFormat.writeFormat(contentType, osw, registry.filteredMetricFamilySamples(filter));
        }
    }

    // flush the writer and write the new reply into `response`
    osw.close();

    // decide whether to compress or not the data
    if (shouldUseCompression(t)) {
        t.getResponseHeaders().set("Content-Encoding", "gzip");
        t.sendResponseHeaders(HttpURLConnection.HTTP_OK, 0);
        final GZIPOutputStream os = new GZIPOutputStream(t.getResponseBody());
        try {
            // send the data from `response` to the output stream
            response.writeTo(os);
        } finally {
            os.close();
        }
    } else {
        long contentLength = response.size();
        t.getResponseHeaders().set("Content-Length", String.valueOf(contentLength));
        if (t.getRequestMethod().equals("HEAD")) {
            contentLength = -1;
        }
        t.sendResponseHeaders(HttpURLConnection.HTTP_OK, contentLength);
        // send the data from `response` to the output stream
        response.writeTo(t.getResponseBody());
    }
    t.close();
    // `response` still contains the reply content
}

Possible solutions

Option A

The simplest alternative implementation would be to get rid of the response field and instantiate a new ByteArrayOutputStream within the handle method. Basically replacing

ByteArrayOutputStream response = this.response.get();
response.reset();

with

ByteArrayOutputStream response = new ByteArrayOutputStream(1 << 20);

Resulting in:

PRO: lower memory allocation. We no longer keep 5 responses always in memory.
CON: we instantiate one new object each time the endpoint is scraped. This object, though, becomes garbage-collectable right after the method exits.

Option B

Another option came to my mind, but is seems more complex to me. Given that the method ends up writing in the OutputStream from t.getResponseBody(), another alternative could be to wrap this stream into a writer and let the method TextFormat.writeFormat write directly on it. This way, no new ByteArrayOutputStreams need to be created to handle the request. Going this way, however, the changes are not so trivial, for example the lines

long contentLength = response.size();
t.getResponseHeaders().set("Content-Length", String.valueOf(contentLength));

rely on the ByteArrayOutputStream containing the data.

In this case:

PRO: same PROs as before, plus no ByteArrayOutputStream allocation
CON: deeper code refactoring and possible open points

EDIT: Description improved as per comment https://github.com/prometheus/client_java/issues/703#issuecomment-930136163

Sep 26 '21 14:09 gvisco

Looking at your analysis and the code...

as big as the serialized reply, and allocated 5 times, once per thread in the pool

I agree. The ByteArrayOutputStream will grow to the maximum size of the largest response that the thread has ever served (max number of bytes written to the ByteArrayOutputStream).

never re-used, since it is overwritten for each incoming request

I disagree. The ByteArrayOutput stream is reused.

never de-allocated or cleaned up after the reply has been delivered

I agree. By design, the ByteArrayOutputStream has no way to de-allocate memory. Calling reset() doesn't free any memory, but just resets the count of bytes in the ByteArrayOutputStream

The code appears to be optimized to prevent memory allocation/garbage collection in lieu of memory use. Given the usage pattern of the exporters using the code (Prometheus performing periodic endpoint scraping), this optimization makes sense to me.

Option A solution...

will increase memory allocation/garbage collection.
will not necessarily decrease memory usage if all 5 threads are in use. (the memory usage could be equivalent.)

Option B solution...

Removing the use of the ByteArrayOutputStream (and removing the ability to get/send Content-Length) would require...

chunked-transfer encoding to be implemented to use HTTP/1.1 connection semantics (i.e. Connection: keep-alive)

force the connection to be closed after every request/response (i.e. HTTP/1.0 connection semantics.)

Implementing chunked-transfer encoding would most likely be implemented using aByteArrayOutputStream, though the buffer (chunk) size could be smaller.

(I am not a maintainer)

Sep 29 '21 12:09 dhoard

Thank you @dhoard for your comments.

Regarding this point

I disagree. The ByteArrayOutput stream is reused.

You are definitely right. I meant that the content of the ByteArrayOutputStream is never re-used. The ByteArrayOutputStream is reused indeed. I would edit the original post to avoid further misunderstanding, I hope that's ok.

The code appears to be optimized to prevent memory allocation/garbage collection in lieu of memory use. Given the usage pattern of the exporters using the code (Prometheus performing periodic endpoint scraping), this optimization makes sense to me.

Can you please elaborate on this? It's a genuine question as my experience with Prometheus is limited. I can understand the prevention of garbage collection, in principle, but I'm wondering if, in this context, preventing the GC is so critical to justify the permanent allocation of 5 times the size of the reply. I say this also considering that the periodic endpoint scraping happens not very frequently (the default interval is 1m) and we would allocate just one object per scraping request, so the pressure on the GC seems pretty limited.

Regarding your comment on Option A

will not necessarily decrease memory usage if all 5 threads are in use. (the memory usage could be equivalent.)

Yes, you are right. However even in case of equivalent memory usage, this would be a temporary situation, since the allocated memory could be freed after the replies are sent. Therefore, although the maximum memory used could be the same, the average memory usage would be way smaller. If the reply size is R bytes, we would have large time intervals with no memory allocated (or eligible for garbage collection) and possible spikes of allocated memory, up to 5R bytes. In the current implementation, instead, we have 5R bytes always allocated.

Sep 29 '21 14:09 gvisco

The LocalByteArray was introduced for gzip support if the client supports gzip compression.

I'm not sure if it is a problem to keep the Byte array between two polls. The memory is required for the Prometheus endpoint to work, so an application must make sure that the size of the Byte array is available, otherwise the next poll will fail. So, if this chunk of memory needs to be available anyway, why not just keep it allocated?

Sep 29 '21 15:09 fstab

@gvisco my comments were based on my analysis of the code. I'm not advocating either way. I wouldn't expect the memory usage (as currently designed) or a refactor that created the ByteArrayOutputStream (more GCs) in a real application would have much impact.

@fstab I'm not sure I follow...

The memory is required for the Prometheus endpoint to work, so an application must make sure that the size of the Byte array is available, otherwise the next poll will fail.

Various other objects are created to service a poll of the endpoint. Even with the ByteArrayOutputStream being allocated... things can still fail due to an OutOfMemoryException issue if memory is exhausted when the other objects are created.

Again, I'm not advocating either way.

Sep 29 '21 15:09 dhoard

@dhoard don't get me wrong, I did not want to push you into one direction or the other, I only wanted to understand better your view and possibly explain better mine. As I told you I have limited knowledge of the scenarios where this class is used.

I'm not sure if it is a problem to keep the Byte array between two polls. The memory is required for the Prometheus endpoint to work, so an application must make sure that the size of the Byte array is available, otherwise the next poll will fail. So, if this chunk of memory needs to be available anyway, why not just keep it allocated?

@fstab my concern is that the same memory used by the Prometheus endpoint is also used by the monitored application, so having 5 times the response constantly in memory "steals" resources that could be needed by the application itself. Practically speaking, the chances to get an OutOfMemory error are higher.

For the sake of clarity, on @dhoard comment

I wouldn't expect the memory usage (as currently designed) or a refactor that allocated memory/more GCs in a real application would have much impact.

I opened this issue after that, working on a real application, my team decided not to use this class because of the memory consumption. We experienced OutOfMemory errors before, in production, and we would not take the risk to see them happening more often. Also, in this applcation, having one object allocated each few seconds has no real impact on GC. I understand that my case may be different from the average usage of this library - very large response size; risk of OOM; fine with additional objects allocations or garbage collection - and yet I think that the issue under discussion is not only theoretical as it might affect the decision to use or not use the HTTPServer.

Sep 29 '21 16:09 gvisco

@gvisco good conversation... always good to have more eyes on any code. I'm not a maintainer... just an engineer looking at the code. :smile:

Sep 29 '21 18:09 dhoard

Thanks @gvisco and @dhoard, I appreciate the conversation as well. And even though I am a maintainer that doesn't mean that my opinion is always correct :)

Anyway, if your application runs out of memory because of the LocalByteArray, it will still run out of memory if we remove the LocalByteArray after each scrape and create a new one with the next scape. You just need to be unlucky so that your other memory consuming tasks run at the same time as a scrape. The only way to prevent this would be to synchronize your other requests with the poll interval to make sure that this does not happen at the same time. Otherwise removing the LocalByteArray will not fix your memory issue, it will just make your application crash less often.

However, we could replace the ExecutorService with an implementation that has only 1 initial thread, and maybe 5 threads as a maximum, because in most applications 1 thread should be enough and then the other threads would never get created. That would save us the unnecessary copies of the LocalByteArray.

Apart from that, I don't think the original intention of keeping the LocalByteArray was garbage collection. If the initial 1M capacity is not enough, the LocalByteArray will increase in size and copy the data to a larger array. Keeping it will have the effect that this happens only once, because after that it remains in memory with the new capacity.

The LocalByteArray implementation sounds like an unnecessary micro-optimization to me, because as scrapes happen only every minute or so there should not be much overhead either way. However, as said, re-creating the LocalByteArray with each call will not solve memory issues, just make them occur less frequently. So re-creating the LocalByteArray with every scrape won't prevent the issue.

Sep 29 '21 19:09 fstab

@fstab @gvisco PR https://github.com/prometheus/client_java/pull/704 to implement the changes discussed above.

Sep 30 '21 00:09 dhoard

Thank you both for the answers and the detailed info and explanation :)

Sep 30 '21 07:09 gvisco

I closed https://github.com/prometheus/client_java/pull/704 since it required more refactoring and added some negative behavior.

Oct 06 '21 11:10 dhoard

@fstab I was thinking more about this code.

Do you have any insight as to why we buffer to an ByteArrayOutputStream then write to the OutputStream versus just chaining the streams to the OutputStream of HttpExchange getResponseBody?

Jan 17 '22 19:01 dhoard

We need to know if there's an exception when generating the output to set the response code correctly.

Jan 17 '22 21:01 brian-brazil

@brian-brazil the code never catchs/handles an Exception in the HTTPMetricHandler.handle() method. It always send a HttpURLConnection.HTTP_OK.

https://github.com/prometheus/client_java/blob/c83877ab01539a34f172d72405db0f1b9b29cc60/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L110

https://github.com/prometheus/client_java/blob/c83877ab01539a34f172d72405db0f1b9b29cc60/simpleclient_httpserver/src/main/java/io/prometheus/client/exporter/HTTPServer.java#L125

Jan 17 '22 21:01 dhoard

The TextFormat.writeFormat can throw exceptions, which will cause the request to fail.

Jan 17 '22 22:01 brian-brazil

@fstab and others I was just reviewing the prometheus HTTP server code as well and noticed the ByteArrayOutputStream albeit in 1.0.0 there is no threadlocal.

I'm using Jooby which allows using a bytebuffer instead of OutputStream.

What I am doing that I hope works out is to use locks and reuse a byte buffer.

e.g.

private final ByteBufferedOutputStream stream = new ByteBufferedOutputStream();
private final ReentrantLock lock = new ReentrantLock();
void someMetricsEndpoint(SomeExchange exchange) {
  lock.lock(); // try lock will actually be used but just for brevity.
  try {
    stream.reset(); // set the internal byte array index back to 0
    prometheus.write(stream); // not the actual code but you get the idea
    exchange.send(stream.byteBuffer()); // the internal byte array is reused.
  }
  finally {
    lock.unlock();
  }
}

The idea is that only one scraper is hitting the server and this will prevent a whole bunch of garbage collection.

I omitted error handling and setting content type but I think you get the idea. Obviously you have to be aware of what the HTTP server prefers as some only allow doing OutputStream (like write(byte[], int start, int end)) and whether it is a blocking call or not.

Here is somewhat the ByteBufferedOutputStream (obviously remove the parts you do not need): https://github.com/jstachio/jstachio/blob/main/api/jstachio/src/main/java/io/jstach/jstachio/output/ByteBufferedOutputStream.java

If the try lock fails you can do a status code of 429 of TOO_MANY_REQUESTS.

Oct 18 '23 19:10 agentgt

client_java client_java copied to clipboard

Use of ThreadLocal storage in HTTPServer seems resulting in inefficient memory allocation

Disclaimer

TL;DR

Description

Code walkthrough

Possible solutions

Option A

Option B

client_java
client_java copied to clipboard