grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

Investigate premature OutputBuffer recycling in Tomcat transport with AsyncContext

Open becomeStar opened this issue 1 month ago • 2 comments

Follow-up investigation for #12524.

Symptom

When running tests with Tomcat 10 Embedded, we encounter java.lang.IllegalStateException: The output buffer object has been recycled and is no longer associated with this facade. This indicates that the OutputBuffer is being accessed after Tomcat has recycled it.

Root Cause Analysis (Preliminary)

It seems related to Tomcat 10's RECYCLE_FACADES optimization. The facade object (specifically the underlying OutputBuffer) appears to be recycled prematurely—likely as soon as the initial servlet request handler returns—even though gRPC is still using the AsyncContext to write the response stream asynchronously.

Current Workaround

We have disabled facade recycling (setDiscardFacades(false)) in the test environment to prevent flakiness (See PR #12530).

Goal

Investigate if gRPC's usage of AsyncContext needs adjustment to be compatible with Tomcat's facade recycling logic, or if this is an issue with Tomcat's behavior in async scenarios.

References

becomeStar avatar Nov 27 '25 06:11 becomeStar

@ejona86

Hello. I've deeply investigated the root cause of the IllegalStateException (or premature recycling behavior) and I'd like to share my findings and propose a potential fix.

Root Cause Analysis: Race Condition caused by Header Size Limit

The issue manifests as a race condition when writeHeaders triggers a stream closure due to header size limits (or similar constraints).

  1. Trigger: When writeHeaders calls flush(), the underlying container (Tomcat) detects that the headers exceed the size limit.
  2. Container Action: Tomcat immediately closes the stream and recycles the OutputBuffer / Response object.
  3. Race Condition: Inside AsyncServletOutputStreamWriter.runOrBuffer, immediately after the flush action runs, the code attempts to check isReady(). If the container recycles the buffer in between these lines, the isReady() check fails.

Relevant Code in AsyncServletOutputStreamWriter.java:

private void runOrBuffer(ActionItem actionItem) throws IOException {
  WriteState curState = writeState.get();
  if (curState.readyAndDrained) {
    actionItem.run(); // <--- 1. flush() executes. Tomcat may detect error and recycle OutputBuffer here.
    
    // <--- RACE WINDOW: If Tomcat recycles the buffer here...

    if (actionItem == completeAction) {
      return;
    }
    if (!isReady.getAsBoolean()) { // <--- 2. Accessing recycled OutputBuffer here throws IllegalStateException
      // ...
    }
  }
  // ...
}

To verify this, I inserted a Thread.sleep() right before the isReady() check, and the IllegalStateException was reliably reproduced. This confirms that the crash is caused by accessing a recycled object (via isReady()) due to a race between the container's error handling (recycling) and the isReady() check.

Proposed Solution

Since controlling the container's internal recycling timing is difficult (and setDiscardFacades(false) acts more as a workaround), I would like to suggest handling this gracefully within ServletServerStream.

We could catch the exceptions (IllegalStateException) that occur when accessing recycled objects during the flush operation. If such an exception occurs, it likely indicates that the stream is no longer valid, so we could treat it as a stream cancellation.

Proposed Change in ServletServerStream.java:

@Override
public void writeHeaders(Metadata headers, boolean flush) {
  writeHeadersToServletResponse(headers);
  resp.setTrailerFields(trailerSupplier);
  try {
    writer.flush();
  } catch (IOException e) {
    logger.log(WARNING, String.format("[{%s}] Exception when flushBuffer", logId), e);
    cancel(Status.fromThrowable(e));
  } catch (IllegalStateException e) {
    // Handle the case where the container (e.g. Tomcat) recycles the response 
    // concurrently (e.g. due to header size limit), causing IllegalStateException.
    logger.log(WARNING, String.format("[{%s}] Runtime exception during flush (likely recycled)", logId), e);
    cancel(Status.fromThrowable(e));
  }
}

This approach seems to make the transport more robust against container lifecycle events without relying on container-specific settings like discardFacades.

If you agree with this direction, I would be happy to submit a PR to implement this fix and potentially remove the setDiscardFacades(false) workaround from the tests. I'd appreciate your thoughts on this approach.

becomeStar avatar Dec 09 '25 02:12 becomeStar

After further testing, I found that updating only writeHeaders() is not sufficient.

Even if we catch IllegalStateException during the header flush(), the container (Tomcat) may already have recycled the Response / OutputBuffer. After that point, any subsequent write path in ServletServerStream (writeFrame()) still attempts to access the recycled servlet objects.

When this occurs, Tomcat throws either IllegalStateException or NullPointerException, depending on which internal field was cleared during recycling. In writeFrame(), we observed NullPointerException when accessing recycled objects, so we catch that specific exception and cancel the stream immediately. So handling only IllegalStateException inside writeHeaders() cannot prevent later failures.

To illustrate this, the following code (from ServletServerStream) also needed protection. Without this, tests continue to fail:

@Override
public void writeFrame(@Nullable WritableBuffer frame, boolean flush, int numMessages) {
  if (frame == null && !flush) {
    return;
  }

  try {
    if (frame != null) {
      int numBytes = frame.readableBytes();
      if (numBytes > 0) {
        onSendingBytes(numBytes);
      }
      writer.writeBytes(((ByteArrayWritableBuffer) frame).bytes, frame.readableBytes());
    }

    if (flush) {
      writer.flush();
    }
  } catch (IOException e) {
    logger.log(WARNING, String.format("[{%s}] Exception writing message", logId), e);
    cancel(Status.fromThrowable(e));
  } catch (NullPointerException e) {
    logger.log(
        WARNING,
        String.format("[{%s}] Runtime exception during flush (likely recycled)", logId),
        e);
    cancel(Status.fromThrowable(e));
  }
}

Because container-side recycling can happen between the header flush and later frame writes, a broader fix is required:

  • Catch the recycling-related exception (IllegalStateException or NullPointerException).
  • Log a warning and cancel the stream immediately (cancel(Status.fromThrowable(e))).
  • This effectively stops subsequent operations and prevents the test from crashing.

This prevents subsequent calls (e.g., writeFrame()) from crashing and allows the tests to pass consistently.

I can prepare a PR implementing this broader fix if this direction seems reasonable.

becomeStar avatar Dec 09 '25 12:12 becomeStar

Container Action: Tomcat immediately closes the stream and recycles the OutputBuffer / Response object.

Tomcat triggering IllegalStateException in this case is broken. IllegalStateException means the caller did something wrong. There was nothing wrong here; Tomcat corrupted its own state. This is a Tomcat bug. We should recommend that people use setDiscardFacades(false) until Tomcat isn't buggy itself.

I'd accept gRPC could do something if there is a signal when the header size problem happens that gRPC could observe to update its state. But it sounds like that isn't the case. I wonder if cancellation is also broken it Tomcat with façades.

ejona86 avatar Dec 22 '25 15:12 ejona86

@ejona86

Thank you for the clear guidance. I took another look at the Tomcat source code, and as you suspected, there is still no explicit error signal or lifecycle flag other than internal fields (such as OutputBuffer) being set to null during recycling.

I also agree that this kind of premature recycling could occur in scenarios beyond header size limits, which makes handling specific exceptions or adding null checks in grpc-java a “whack-a-mole” approach rather than a fundamental fix.

Catching IllegalStateException or relying on null checks might address the immediate symptom, but it risks masking container-side bugs and coupling grpc-java to Tomcat’s internal lifecycle behavior.

Given that, I agree this is fundamentally a Tomcat-side issue, and keeping the setDiscardFacades(false) workaround seems like the most reasonable and least risky path forward until Tomcat provides a clearer signal.

becomeStar avatar Dec 25 '25 07:12 becomeStar