Fix "...Span <span name> is GC'ed without being ended." issue (caused by a BT timeout)

Open sming opened this issue 4 years ago • 1 comments

100's of Tracing Spans are left un-ended from every query timeout

I am a prism goalie
Who wants to have a stable heroic
So that I can focus on features and not get woken up at night and have angry users

These un-ended spans represent a real runtime risk to heroic. If ~700-1000 of these are left hanging around after each timeout-d query, it's conceivable that the JVM will :

potentially run out of memory altogether
experience much longer GC pauses / sweep times (cos of all the hanging spans needing reaping)
hugely inflate the size of heroic's logs, costing us $$$ and obscuring "genuine" problems

Proposed Solution

find the correct location to catch the BT timeout exception (not trivial)
catch it, end the span and throw it out again

Repro Steps

run heroic locally with GUC config and on branch feature/add-bigtable-timeout-settings-refactored
capture a lengthy query from grafana using the chrome dev tools network tab
alter the query to hit localhost and watch the logs, you'll see this message

List of methods concerned from logs

ERROR io.opencensus.trace.Tracer - Span localMetricsManager.fetchSeries is GC'ed without being ended.
ERROR io.opencensus.trace.Tracer - Span bigtable.fetchBatch is GC'ed without being ended.

Feb 11 '21 22:02 sming

FYI @adsail , moving to inbox as it's not something we'll need to tackle until more aggressive timeouts are deployed

Mar 24 '21 19:03 sming