heroic icon indicating copy to clipboard operation
heroic copied to clipboard

Fix "...Span <span name> is GC'ed without being ended." issue (caused by a BT timeout)

Open sming opened this issue 4 years ago • 1 comments

100's of Tracing Spans are left un-ended from every query timeout

  • I am a prism goalie
  • Who wants to have a stable heroic
  • So that I can focus on features and not get woken up at night and have angry users

These un-ended spans represent a real runtime risk to heroic. If ~700-1000 of these are left hanging around after each timeout-d query, it's conceivable that the JVM will :

  • potentially run out of memory altogether
  • experience much longer GC pauses / sweep times (cos of all the hanging spans needing reaping)
  • hugely inflate the size of heroic's logs, costing us $$$ and obscuring "genuine" problems

Proposed Solution

  • find the correct location to catch the BT timeout exception (not trivial)
  • catch it, end the span and throw it out again

Repro Steps

  • run heroic locally with GUC config and on branch feature/add-bigtable-timeout-settings-refactored
  • capture a lengthy query from grafana using the chrome dev tools network tab
  • alter the query to hit localhost and watch the logs, you'll see this message

List of methods concerned from logs

  1. ERROR io.opencensus.trace.Tracer - Span localMetricsManager.fetchSeries is GC'ed without being ended.
  2. ERROR io.opencensus.trace.Tracer - Span bigtable.fetchBatch is GC'ed without being ended.

sming avatar Feb 11 '21 22:02 sming

FYI @adsail , moving to inbox as it's not something we'll need to tackle until more aggressive timeouts are deployed

sming avatar Mar 24 '21 19:03 sming