Should exporters retry?
Recently we found a bug in one of the backend when we retry to send the same spans due to a failure. The bug was subtle and caused by the fact that the first request was finished with DEADLINE_EXCEEDED (most likely generated by the load balancer or the client), even though the data got to the backend, then the library retried to send the same data so data got duplicated in the backend.
Probably the same problem can happen in other backends, so the main question is if we should ever retry from the OC exporters? If we do that what should be the policy?
See for more details: https://github.com/census-instrumentation/opencensus-java/issues/1201
/cc @adriancole @tsloughter @ramonza
We don't yet retry, but it was our intention to retry before dropping.
The issue is that the backends are supported to merge instead of replace or drop if multiple spans arrive with the same trace and span ids?
But also fine with defining it as not to be retried.
My opinion is to not retry. All data is best-effort anyway and retrying adds complexity and the possibility of cascading failure.
Recently we found a bug in one of the backend when we retry to send the same spans due to a failure. The bug was subtle and caused by the fact that the first request was finished with DEADLINE_EXCEEDED (most likely generated by the load balancer or the client), even though the data got to the backend, then the library retried to send the same data so data got duplicated in the backend.
I think the right way to solve duplicated data is to have some de-dup mechanism in the backend (e.g. by treating traceId + spanId as the unique key for spans, and introduce some sequence number for metrics/logs). Retry has its value and we can decide what is the right place/time to use it.