Discarding of HTTP result might hide failures.
https://github.com/mtth/tracing/blob/aa0cc7001795b5ccf9fbc87a9fa33282bdda5779/src/Monitor/Tracing/Zipkin.hs#L123
Do we need to check the HTTP status? I'm unsure about the impact - but in my case a unreachable address resulted in a 502 Bad Gateway status.
Maybe we should check for 2XX results (statusIsSuccessful)? But what should we do in other cases? Just output a warning?
I think the best is to expose these errors to users so they can be handled appropriately for the use-case - there isn't a one-size-fits-all handling strategy. Would https://github.com/mtth/tracing/pull/9 with an additional status check work? It exposes a separate publishSpans function and typed exception to provide this flexibility without bundling too much logic in the constructor.
@mtth Sorry for the late response, took some time until I found a chance to think it through - from my perspective #9 alone wouldn't solve everything perfectly. Maybe I've misunderstood something, but I think the following shortcoming is there:
- How to handle exceptions during background publishing? I tend to believe that most real world applications that use distributed tracing are some kind of long running back-end services that need continuous publishing and aren't able to collect and send all traces on a single point. For sure this routine/logic could be reproduced on the user side. https://github.com/mtth/tracing/blob/8d1d188be80d5b0d8ec79b251d1acfc34c6c9e76/src/Monitor/Tracing/Zipkin.hs#L136-L138
- With this I really think the API should be smoother here. Because to make things work without data-loss in cases of transient errors the user needs to implement the background publishing (that loops
publishand retries withpublishSpansin case of exceptions) - and also something similar forwith(retry logic in case of exceptions duringpublish).
So with that said, by adding a additional check, you could solve the issue! But I fear the price we pay is a really hard to understand API usage. From my perspective we should strive for a solution that works with the ZPK.with zpkSettings . ZPK.run function. So I wonder if it makes sense to establish some kind of optional exception handler in the settings record (rasing the exception could be the default handler)? Here the user could really decide what to do and the operation could be resumed (by internally re-queueing the spans).
And don't get me wrong, this is by no way a critic. I'm thankful for any improvement and value your development efforts ❤️ ! So please take this just as a side note/comment - I could be wrong.
Thanks for the thoughtful reply, @alaendle. Both approaches are complementary, we can both:
- expose primitives for clients to do custom error handling (that's #9);
- enhance the current default convenience behavior, as long as it's generic and simple enough.
For 2, we could for example add the HTTP status check and buffer spans on failure until the next retry. We'd add some state to Zipkin (maybe a TVar Seq, with a configurable limit to avoid OOMs). What do you think?