couchbase-lite-core icon indicating copy to clipboard operation
couchbase-lite-core copied to clipboard

CBL-2791: Enable actor stack trace mechanism

Open borrrden opened this issue 2 years ago • 8 comments

This feature will keep track of two pieces of information:

  1. For any given execution, the path of enqueue and execution calls that led to the execution
  2. For any given actor, the linear history of enqueue and execution calls (regardless of source)

If an exception occurs, this information is dumped to the logs.

borrrden avatar Feb 16 '22 23:02 borrrden

Mark as draft until resolution of iOS simulator issue

borrrden avatar Feb 17 '22 00:02 borrrden

@snej Ths GCD implementation of this was a bit awkward. Let me know if you have any ideas about how to improve it. It's complicated by the fact that the manifest used as the "queue manifest" needs to live long enough to be used by however many recursive calls happen in a given execution. For the threaded mailbox this just meant using a thread local static shared_ptr and copying it to each context that uses it. However, thread_local is not allowed in iOS simulator and doesn't really fit well with the queue based logic so I tried to make use of the set_specific API, which needs a pointer. So when retrieving, a pointer to the shared_ptr is retrieved and then deferenced (i.e. copied).

borrrden avatar Feb 17 '22 02:02 borrrden

Do we need this for Apple platforms? It sounds like the same info that Xcode's debugger already shows. (At least item 1.)

snej avatar Feb 19 '22 00:02 snej

That's great if you are running in a debugger, but I'm thinking about this information being put into our logs so that we can have it even from the field.

borrrden avatar Feb 19 '22 00:02 borrrden

That is a lot of overhead to add to production builds! I think I'd need to be convinced that this is necessary. I can see that it would be useful in some occasions, but it would be slowing everything down and adding to memory bloat.

Is this something that can be disabled except by a runtime flag?

snej avatar Feb 19 '22 00:02 snej

I'm trying to balance things out here. If it is enabled or disabled at runtime then I guarantee we get into a case where it's off and we end up back where we started -> with a bunch of intertwined logs that make it difficult to navigate through the flow of an issue of "replicator getting stuck" without context of the calls that led to that point. I disagree that it adds an excessive amount of memory pressure since it's going to prune out entries as it receives new ones (I'm certainly open to decreasing the number of entries that are saved though). I'm going to start proposing a lot of changes like this because our logs in general often leave us puzzled as to what is going on. I want a way to rectify this situation by collecting some data about the state of the program that can be accessed on demand or at exception time in this case. Whether or not having thread / queue local stuff adds too much of a performance penalty is something I could debate about.

In short we need something here to help us navigate an actor based world in which the most common form of bug is a race condition or hang. Simply logging things is not enough. What I am after is an answer to the question "in what order did things happen in order to get here?"

EDIT I also thought of adding this information to logs instead, but I figured that collecting it in memory would be overall better for performance than logging it all in realtime.

borrrden avatar Feb 19 '22 00:02 borrrden

We really need performance testing in CI so we can see whether a new feature like this affects performance...

snej avatar Feb 19 '22 01:02 snej

That's something that could be arranged I think. ~~The only problem is that there is no clear pass/fail metric to use in CI that I can think of.~~

borrrden avatar Feb 19 '22 01:02 borrrden