blaze Non-deterministic premature EOF

Story

I have a service that is

listening to a stream of request bodies
querying a third party http server using blazeclient
forwarding the responses further down the stream

90% of the requests are fine while 10% fails with org.http4s.InvalidBodyException: Received premature EOF, where the 10% is not tied to particular requests, so if I retry the same requests there's 90% chance they'd pass.

Reproduction

I managed to reproduce the issue in a controlled environment.

I emulate the stream of request bodies, by keep emmiting a single static request payload val body = "x".repeat(requestPayloadSize) at a fixed rate Stream.fixedRate
query the local test server val req = Request[IO](POST, uri).withEntity(body) ... simpleClient.stream.flatMap(c => c.stream(req)).flatMap(_.bodyText) that responds with a static payload val response = "x".repeat(responsePayloadSize) ... case POST -> Root => Ok(response)
finally I print the index of the request along with the chunk size .evalMap(c => IO.delay(println(s"$i ${c.size}")))

Conclusion

Based on some extended experiment I managed to find some magic numbers for request and response payload sizes. Below these payload sizes I can run the app for an extended period without any exceptions, while if both request and response sizes reach these thresholds the client will eventually throw an EOF exception.

  // Numbers below vary on different computers
  // In my case, if the request payload size is 32603 or greater
  //  AND response payload size is 81161 or greater
  //  then we get EOF exception in some but not all cases
  // If however either of these payload sizes is lower then
  //  EOF exception doesn't occur, even if running for an extended period

Notes

You can find the test project at https://github.com/slve/http4s-eof, there the only scala source in https://github.com/slve/http4s-eof/blob/master/src/main/scala/Http4sEof.scala.

Using fs2 2.5.0, http4s 0.21.18. https://github.com/slve/http4s-eof/blob/master/build.sbt

The server part is only there to aid the testing, but regardless of what server you'd run your test with, the client will eventually drop an EOF exception in a short period.

Feb 11 '21 16:02 slve

Thanks for the reproducible case. This is an excellent report.

I sent a PR: https://github.com/slve/http4s-eof/pull/1. I think this clears up by sharing one client for the app instead of creating a client per request.

Feb 12 '21 05:02 rossabaker

Thanks for the quick response @rossabaker!

I've tested it side-by-side with my original version where I've inadvertently created a client on each request and it definitely made a change. I merged your PR.

I also quickly further tested your fixed version, only increasing the payload sizes from ~32kB and ~81kB both to 500kB and managed to get the same EOF error message, will get back to the topic in a bit.

EDIT: I've fine tuned the thresholds and opened a PR with my changes https://github.com/rossabaker/http4s-eof/pull/1/files?w=1

Feb 12 '21 07:02 slve

Okay, that was the only misuse I saw reviewing it last night. Going to give this the bug label and try to chase it more this weekend.

Feb 12 '21 14:02 rossabaker

Awesome, thank you @rossabaker. Just to note, I've had other two experiments in different branches, one is using http4s/jdk-http-client while the other is using softwaremill/sttp and ran into similar issues.

Feb 12 '21 15:02 slve

Hi @rossabaker

We're facing EOF error in one of our services and I'm using the project provided by @slve to reproduce the issue. Please see the findings below:

With blaze client, if request size is bigger than 65397 bytes and response is about 1Mb first request fails with EOF/Broken Pipe/Connection reset by peer regardless of time the app was running, it happens almost immediately. Number 65397 is 65535 - default request headers length. I'm testing it with 70000 bytes request size and 1000000 response.
With turned on tracing logs and debugging I was able to figure out that request itself streams fine, but exception occurs when streaming back response. See point 3 below.
It seems like there is some sort of race in Http1Stage.scala lines 219-230. If I set a breakpoint on line currentBuffer = BufferTools.concatBuffers(currentBuffer, b) simulating delay then test passes. Without breakpoint fails almost immediately into cb(eofCondition()). Prior falling into eofCondition Http1Stage.drainBody function is called and logs HTTP body not read to completion. Dropping connection.

channelRead().onComplete {
  case Success(b) =>
    currentBuffer = BufferTools.concatBuffers(currentBuffer, b)
    go()

  case Failure(Command.EOF) =>
    cb(eofCondition())

  case Failure(t) =>
    logger.error(t)("Unexpected error reading body.")
    cb(Either.left(t))
}

Could you please take a look at the issue again and provide some updates or estimates on how soon it could be fixed?

Thanks in advance!

Nov 23 '21 22:11 eugene-cheverda