cats-effect icon indicating copy to clipboard operation
cats-effect copied to clipboard

Wrong stack trace with timeout

Open geirolz opened this issue 3 years ago • 9 comments
trafficstars

Given the following code, I get a TimeoutException for 15000 millis that is right BUT the stack trace take me to the line of t2 instead of t3 I was expecting that the stack trace takes me to the timeout exceeded

val t1 = IO.sleep(1.seconds)
val t2 = IO.sleep(2.seconds).timeout(5.seconds)
val t3 = (t1 >> t2).timeout(1500.millis)
t3.unsafeRunSync() // TimeoutException at timeout Line 2

While if I add a timeout at t1 i get the same exception but the stack trace bring me to t1

val t1 = IO.sleep(1.seconds).timeout(5.seconds)
val t2 = IO.sleep(2.seconds).timeout(5.seconds)
val t3 = (t1 >> t2).timeout(1500.millis)
t3.unsafeRunSync() // TimeoutException at timeout Line 1

If there aren't nested timeout i get the right stack trace

val t1 = IO.sleep(1.seconds)
val t2 = IO.sleep(2.seconds)
val t3 = (t1 >> t2).timeout(1500.millis)
t3.unsafeRunSync() // TimeoutException at timeout Line 3

The issue persists with all cats.effect.tracing.mode mode

geirolz avatar Jan 10 '22 10:01 geirolz

This is very interesting! Can definitely confirm this is a bug.

djspiewak avatar Jan 16 '22 04:01 djspiewak

Okay, I figured this out. It has to do with the way that we're keying traces. full tracing mode should have avoided the issue, but that appears to have some other issues that need ironing out.

The problem is that the first one wins on determining the call site for the flatMap inside of timeout. Anything starting with the cats. package is filtered out, which is why the trace lists your source code as being the caller. When you only have the one timeout call, the call site is correct (relative to expectations), and so you see the right thing in the trace (albeit with some repeated frames due to the aforementioned filtering). With multiple timeouts, you'll end up getting the very first as the trace value for the flatMap inside of timeout (since obviously that flatMap is fully shared), meaning that all timeouts will be traced back to that very first evaluation call site, which is precisely why the stack trace is so weird.

This is a bit annoying. Under normal circumstances, we wouldn't be trying to trace timeout at all, it's just the fact that it happens to be derived from flatMap and race which makes it in any way traceable, but of course it's the same flatMap every time, resulting in the unintuitive attribution you reproduced. Going to give some thought to this.

djspiewak avatar Jan 16 '22 04:01 djspiewak

Sorry for the late reply @djspiewak, thank you for your detailed explanation. The real problem is that when you have several timeouts with the same duration could be difficult to understand which one is expired(this is the problem we had in prod and the reason behind this ticket)

geirolz avatar Feb 01 '22 20:02 geirolz

I expected as much, but one thing kind of puzzles me: even if the last trace frame is deceptive, wouldn't the previous trace frames be a little more illustrative? That doesn't hold true in the OP example, but a production codebase is unlikely to just consist of a whole bunch of timeout calls.

djspiewak avatar Feb 01 '22 21:02 djspiewak

In our case we have an engine that executes some steps ( like saga orchestrator pattern ), each step has a max timeout in the engine while each step action is defined as IO so, in some case, we have timeout inside the step action.

That step action sometimes is a composition of multiple IO that could have an inner timeout. So, in case of timeout exception, we have a stack trace that brings us in the first timeout that isn't the one that threw the exception

geirolz avatar Feb 01 '22 23:02 geirolz

So as a quick note, one workaround (not ideal, but functional) for this problem would be to replace .timeout(5.seconds) with .timeoutTo(5.seconds, IO.defer(IO.raiseError(new TimeoutException))). (credit to @armanbilge for realizing this would solve it)

djspiewak avatar Dec 24 '22 23:12 djspiewak

Thanks, @djspiewak! not ideal just because it is a bit verbose?

geirolz avatar Dec 29 '22 18:12 geirolz

Just because it's a bit verbose :-) It's annoying because it's forcing you to effectively inline a combinator in order to manipulate the stack trace. This type of thing is somewhat to be expected, but that doesn't make it ideal.

djspiewak avatar Dec 29 '22 22:12 djspiewak

Tested this again with v3.5.0-RC5. Setting cats.effect.tracing.mode=full seems to give accurate stack traces, so if there was a bug there it was since quashed.

For cached tracing it is technically working as it should AFAICT ... and in a production codebase, it seems like there should typically be other context in the trace anyway ... right? 😀

Btw, here's a slightly minimized minimizer.

//> using lib "org.typelevel::cats-effect::3.5.0-RC5"

import cats.effect._
import cats.effect.unsafe.implicits._
import scala.concurrent.duration._

@main def main =
  val t1 = IO.sleep(1.millis).timeout(60.seconds)
  val t2 = IO.sleep(2.seconds).timeout(1.seconds) // line 9
  val t3 = t1 *> t2
  t3.unsafeRunSync()
Exception in thread "main" java.util.concurrent.TimeoutException: 1 second
        at timeout @ test$package$.main(test.scala:8)
        at timeout @ test$package$.main(test.scala:8)
        at unsafeRunSync @ test$package$.main(test.scala:11)
        at *> @ test$package$.main(test.scala:10)
        at timeout @ test$package$.main(test.scala:8)
        at timeout @ test$package$.main(test.scala:8)

The only way I can think to fix this for cached tracing is maybe if we collected some other trace event at the timeout call site and inserted it into the stack. E.g. if it was => Duration. But that's a rather bespoke change that would have to happen in the typeclasses.

So I propose we either:

  1. close this as a won't-fix / can't-fix / by-design / annoying-workaround-available
  2. transform into a more focused issue: add trace events for timeout

Btw, another related idea I had: we currently have IO.trace: IO[Trace] which returns the current trace. But I wonder if we could have something like:

object IO {
  def traced[A](=> ioa: IO[A]): IO[A] = ???
}

It would basically be like defer but collect and insert(/replace?) a tracing event via the lambda, as an easier workaround for these kinds of issues. Actually, I don't yet fully understand why this doesn't work with defer already.

armanbilge avatar May 08 '23 03:05 armanbilge