kong icon indicating copy to clipboard operation
kong copied to clipboard

fix(pdk) - respect the trace flags from the parent span (if present)

Open andyroyle opened this issue 9 months ago • 2 comments

Summary

Kong respects the should_sample decision from a parent span if one was provided. Previously it was not doing this resulting in incomplete traces.

When using kong as an internal api gateway we are seeing incomplete traces. The logs show that the following appears to be happening:

(this is based on a sampling rate of 0.01)

  • Service A makes a request to Service B with no traceparent header specified
  • Kong generates a traceparent header and specifies trace flags of "01" (i.e. the trace should be sampled)
  • Service B handles the request from Service A and ingests the traceparent header
  • Service B then makes a request to Service C, setting the traceparent header appropriately (i.e. using flags "01")
  • Kong then resets the trace flags to "00" (i.e. the trace should not be sampled)
  • Service C handles the request from Service B and, respecting the flags from the traceparent header, does not sample the trace

This results in Kong and Services A and B sampling the trace, but Service C does not, and so we get an incomplete trace.

I identified the get_sampling_decision method as the most likely place where the issue occurs, and since parent_should_sample is a tri-state boolean (i.e. it can be true, false, or nil) it makes most sense (to me) to respect the value of parent_should_sample if it exists.

I tested this locally and it behaves as I would expect; when a traceparent header is included in the request, then the flags are respected, however if not, probabilistic sampling is applied.

However rereading the code, I can't work out why that should be the case, since the probablistic sampler is deterministic based on trace id (i.e. the same trace-id will always return the same outcome). Even stepping through the code, I couldn't work out exactly what was going on. It seems likely to me that it's something to do with how the root span is constructed, but I couldn't tell for sure.

Checklist

  • [ ] The Pull Request has tests
  • [x] A changelog file has been created under changelog/unreleased/kong or skip-changelog label added on PR if changelog is unnecessary. README.md
  • [ ] There is a user-facing docs PR against https://github.com/Kong/docs.konghq.com - PUT DOCS PR HERE

andyroyle avatar May 10 '24 10:05 andyroyle

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar May 10 '24 10:05 CLAassistant

Hi @andyroyle, thank you for your contribution!

Could you please test the branch from this PR and let me know if it soves your problem? The explanation of the fix (which should answer your questions) is in the PR description.

Regarding your changes:

I identified the get_sampling_decision method as the most likely place where the issue occurs, and since parent_should_sample is a tri-state boolean (i.e. it can be true, false, or nil) it makes most sense (to me) to respect the value of parent_should_sample if it exists.

This is not wrong, however, it is also not the way Kong's sampler is currently designed and I'm afraid your logic might turn out to be a breaking change for some. At the moment, as you noticed, the parent's sampling decision is only inherited when false, else the probabilistic logic is applied.

This (if everything worked correctly) should allow configuring the sampler to do either of the following:

  1. Always inherit the parent's decision, when the sampling_rate matches that of the parent
  2. Apply additional sampling on top of the incoming trace, with a different value of sampling_rate

Since some might be relying on (2.), if we wanted to introduce a change and make Kong support a "parent based sampler", it should be configurable. However, I believe that with the fix I linked, this should not be needed any longer, but please do let me know in case I missed anything or you have a different opinion.

I am converting this to draft for the time being. Once again, thank you for your investigation and for bringing this up!

samugi avatar Jun 21 '24 20:06 samugi

Yep, I think you identified the correct way to solve this issue. I think the key piece of context I was missing was that the sampling_rate could be empty (because it's actually the plugin's sampling rate, which can be different from the global sampling rate).

Amazing work, thanks for highlighting this, otherwise I would probably have continued to use my "fixed" version forever, believing it to be correct 😬

andyroyle avatar Jul 05 '24 15:07 andyroyle

Yep, I think you identified the correct way to solve this issue. I think the key piece of context I was missing was that the sampling_rate could be empty (because it's actually the plugin's sampling rate, which can be different from the global sampling rate).

Amazing work, thanks for highlighting this, otherwise I would probably have continued to use my "fixed" version forever, believing it to be correct 😬

Thanks for getting back to me @andyroyle and for confirming the fix (now merged) works! I'm really glad that helped. I'm going to close this PR, your contribution was very much appreciated!

samugi avatar Jul 05 '24 15:07 samugi