Non-deterministic premature EOF
Story
I have a service that is
- listening to a stream of request bodies
- querying a third party http server using blazeclient
- forwarding the responses further down the stream
90% of the requests are fine while 10% fails with org.http4s.InvalidBodyException: Received premature EOF,
where the 10% is not tied to particular requests,
so if I retry the same requests there's 90% chance they'd pass.
Reproduction
I managed to reproduce the issue in a controlled environment.
-
I emulate the stream of request bodies, by keep emmiting a single static request payload
val body = "x".repeat(requestPayloadSize)at a fixed rateStream.fixedRate -
query the local test server
val req = Request[IO](POST, uri).withEntity(body)...simpleClient.stream.flatMap(c => c.stream(req)).flatMap(_.bodyText)that responds with a static payloadval response = "x".repeat(responsePayloadSize)...case POST -> Root => Ok(response) -
finally I print the index of the request along with the chunk size
.evalMap(c => IO.delay(println(s"$i ${c.size}")))
Conclusion
Based on some extended experiment I managed to find some magic numbers for request and response payload sizes. Below these payload sizes I can run the app for an extended period without any exceptions, while if both request and response sizes reach these thresholds the client will eventually throw an EOF exception.
// Numbers below vary on different computers
// In my case, if the request payload size is 32603 or greater
// AND response payload size is 81161 or greater
// then we get EOF exception in some but not all cases
// If however either of these payload sizes is lower then
// EOF exception doesn't occur, even if running for an extended period
Notes
You can find the test project at https://github.com/slve/http4s-eof, there the only scala source in https://github.com/slve/http4s-eof/blob/master/src/main/scala/Http4sEof.scala.
Using fs2 2.5.0, http4s 0.21.18. https://github.com/slve/http4s-eof/blob/master/build.sbt
The server part is only there to aid the testing, but regardless of what server you'd run your test with, the client will eventually drop an EOF exception in a short period.
Thanks for the reproducible case. This is an excellent report.
I sent a PR: https://github.com/slve/http4s-eof/pull/1. I think this clears up by sharing one client for the app instead of creating a client per request.
Thanks for the quick response @rossabaker!
I've tested it side-by-side with my original version where I've inadvertently created a client on each request and it definitely made a change. I merged your PR.
I also quickly further tested your fixed version, only increasing the payload sizes from ~32kB and ~81kB both to 500kB and managed to get the same EOF error message, will get back to the topic in a bit.
EDIT: I've fine tuned the thresholds and opened a PR with my changes https://github.com/rossabaker/http4s-eof/pull/1/files?w=1
Okay, that was the only misuse I saw reviewing it last night. Going to give this the bug label and try to chase it more this weekend.
Awesome, thank you @rossabaker. Just to note, I've had other two experiments in different branches, one is using http4s/jdk-http-client while the other is using softwaremill/sttp and ran into similar issues.
Hi @rossabaker
We're facing EOF error in one of our services and I'm using the project provided by @slve to reproduce the issue. Please see the findings below:
- With blaze client, if request size is bigger than 65397 bytes and response is about 1Mb first request fails with
EOF/Broken Pipe/Connection reset by peerregardless of time the app was running, it happens almost immediately. Number 65397 is65535 - default request headers length. I'm testing it with 70000 bytes request size and 1000000 response. - With turned on tracing logs and debugging I was able to figure out that request itself streams fine, but exception occurs when streaming back response. See point 3 below.
- It seems like there is some sort of race in Http1Stage.scala lines 219-230. If I set a breakpoint on line
currentBuffer = BufferTools.concatBuffers(currentBuffer, b)simulating delay then test passes. Without breakpoint fails almost immediately intocb(eofCondition()). Prior falling intoeofConditionHttp1Stage.drainBodyfunction is called and logsHTTP body not read to completion. Dropping connection.
channelRead().onComplete {
case Success(b) =>
currentBuffer = BufferTools.concatBuffers(currentBuffer, b)
go()
case Failure(Command.EOF) =>
cb(eofCondition())
case Failure(t) =>
logger.error(t)("Unexpected error reading body.")
cb(Either.left(t))
}
Could you please take a look at the issue again and provide some updates or estimates on how soon it could be fixed?
Thanks in advance!