opencensus-specs icon indicating copy to clipboard operation
opencensus-specs copied to clipboard

Should exporters retry?

Open bogdandrutu opened this issue 7 years ago • 4 comments

Recently we found a bug in one of the backend when we retry to send the same spans due to a failure. The bug was subtle and caused by the fact that the first request was finished with DEADLINE_EXCEEDED (most likely generated by the load balancer or the client), even though the data got to the backend, then the library retried to send the same data so data got duplicated in the backend.

Probably the same problem can happen in other backends, so the main question is if we should ever retry from the OC exporters? If we do that what should be the policy?

See for more details: https://github.com/census-instrumentation/opencensus-java/issues/1201

bogdandrutu avatar May 23 '18 17:05 bogdandrutu

/cc @adriancole @tsloughter @ramonza

bogdandrutu avatar May 23 '18 17:05 bogdandrutu

We don't yet retry, but it was our intention to retry before dropping.

The issue is that the backends are supported to merge instead of replace or drop if multiple spans arrive with the same trace and span ids?

But also fine with defining it as not to be retried.

tsloughter avatar May 23 '18 17:05 tsloughter

My opinion is to not retry. All data is best-effort anyway and retrying adds complexity and the possibility of cascading failure.

semistrict avatar May 23 '18 18:05 semistrict

Recently we found a bug in one of the backend when we retry to send the same spans due to a failure. The bug was subtle and caused by the fact that the first request was finished with DEADLINE_EXCEEDED (most likely generated by the load balancer or the client), even though the data got to the backend, then the library retried to send the same data so data got duplicated in the backend.

I think the right way to solve duplicated data is to have some de-dup mechanism in the backend (e.g. by treating traceId + spanId as the unique key for spans, and introduce some sequence number for metrics/logs). Retry has its value and we can decide what is the right place/time to use it.

reyang avatar Apr 10 '19 23:04 reyang