dd-trace-rb icon indicating copy to clipboard operation
dd-trace-rb copied to clipboard

Allow custom id generation

Open rvignesh89 opened this issue 2 years ago • 5 comments

I would like to preface this request by saying that this might not be what DD Tracing was built for but I find the tool very useful so I'm exploring a possible solution. This is more of a question rather than a request 😄

I would like to be able to trace events happening in a CI/CD pipeline and the duration. The parent span would be a PR and all the events happening in a PR like unit tests, integration tests, deployment to staging and production would be child spans. So the parent span would not complete until the PR is merged. I don't want to pass around trace digests to different processes as that would be an expensive solution but instead for different processes to know their parent implicitly.

If I were to do this with DD, I would need the following,

  1. Ability to generate a unique trace_id & span_id for each PR given repository name and id. This is because the ids need to be known beforehand as I can't create the parent span until the PR is merged. As otherwise I wouldn't know the parent span id for the child spans (unit tests in CI runner) I was able to hack something by taking only the last 63 chars of a MD5 digest but I noticed there are a collisions when I test the id generation.
  2. Ability to create spans with these ids. I noticed opentelemetry has a specification for doing this so I'm wondering if there is any plans to allow that in DD. I'm actually already able to do this with some monkey patching of the SpanOperation & TraceDigest (thanks to the new 1.0.0 api 🙇 ) I'd like to avoid the monkey patching for ids if possible.

So my questions are,

  1. Would you be open to extending the trace api to allow custom IDs which fit the DD specifications of a 63 bit unsigned int?
  2. Does DD have plans to extend the size of ids to 128 bits?
  3. If there is a collision what does DD do with the new spans? I tried this and the existing span did not change. So are spans immutable?

rvignesh89 avatar Mar 30 '22 05:03 rvignesh89

So there's a lot to unpack here. Short answer is I would avoid using the tracer to measure human processes (like merging PRs), and that our CI/CD product might have some of the things you're looking for. Wanting to measure PR merges, deploys is interesting; maybe it fits with their goals. (cc @juan-fernandez )

Longer answer:

  • Tracing is designed for measuring technical, autonomous operations, and generally operates in very small time scales. It's ill-suited for very long running tasks (that take hours) as that time scale makes flamegraphs unusable, and the backend won't retain traces that don't complete within a certain window (there's no way to tell whether the trace is running or broken.)
  • Custom ID generation is not permitted; I think this has more to do with how IDs are stored/indexed in the backend. We also use IDs to do some sampling math, so custom IDs might interfere with that.
    • Speaking of sampling, tracing retains a subset of traces it generates and upscales metrics accordingly, in order to keep resource cost down for both users and Datadog. Assuming you want to keep all of these traces, these kinds of traces would need special flags (which we have) to avoid them being sampled out.
  • Not sure about extending IDs to 128bits. Might happen at some point, but its not planned for the near future.
  • Once submitted to the backend, spans effectively become immutable. They are considered completed events, thus cannot be decorated or modified. It's possible to mutate them in the Ruby tracer while the span is active, or before they're submitted via our processing pipeline, but once they're transmitted, that's it.

Hopefully that context helps.

delner avatar Apr 06 '22 16:04 delner

Custom ID generation is not permitted

So I take it as there are no plans to allow that in this library. As for how it's stored/indexed would it be an issue if I use OpenTelemetry SDK to send traces to Datadog through the Datadog exported? Because OpenTelemetry specification allows custom ID generators.

As for the flame graph visualisation, I did try that out and it doesn't look too bad when the duration of traces are in weeks/months. Plus I'm guessing thats the same visualisation being used in the CI/CD pipeline tracing feature of Datadog. While the flame graph is useful I don't think we'd be using that as much as the analytics of individual spans. I hope to be able to use that to answer questions like while using the flame graph to just visualise the steps in the pipeline,

  1. Is the build time increasing over a time period?
  2. How long do different apps take to build images in a pipeline?
  3. How long do engineers test apps in staging?

The CI/CD product does indeed align more with what I'm trying to do here but that looks more like just a plugin to the CI/CD tools like Jenkins/Github which while useful individually, doesn't help me to connect the individual events. But I'd be happy to hear any suggestions from @juan-fernandez

rvignesh89 avatar Apr 10 '22 02:04 rvignesh89

would it be an issue if I use OpenTelemetry SDK to send traces to Datadog through the Datadog exported?

Datadog supports OTel through the use of OTLP, where you can send OTel traces directly to the agent. In that case, you would not use this library.

I hope to be able to use that to answer questions like while using the flame graph to just visualise the steps in the pipeline,

Is the build time increasing over a time period? How long do different apps take to build images in a pipeline? How long do engineers test apps in staging?

I think these are great questions, and demonstrate a particular need. I just don't think tracing is built for this purpose, and likely will be a sub-optimal platform on which to build this. (At least that's my current impression.) I would like to see Datadog have some kind of product/service/feature that addresses this kind of dev process tooling more specifically; I'll share this with our product team and see what they think.

delner avatar Apr 19 '22 05:04 delner

hi @rvignesh89!

The workflow you describe is something we don't currently support in CI Visibility. The closest would be our pipeline instrumentation, which generates spans that would be children of the "PR span" you mention.

We've discussed internally to define an API to let users define their own event based pipelines, in which they send us events, their metadata and parent:child relationships. This should be able to support what you describe. I want to highlight that we haven't decided on whether to proceed with this or not.

I want to ask you a couple of questions to understand if what you are looking for is covered with out current product and how we could support it if it isn't:

  1. Is the build time increasing over a time period?

Is this build time tied to a specific job in a pipeline in your CI? If it is, we are able to provide this already with our pipeline instrumentation, but it is limited to our retention period, which might not be enough depending on your use case.

  1. How long do different apps take to build images in a pipeline?

Same as above.

  1. How long do engineers test apps in staging?

This one is unclear to me. Would this metric correlate with the time a PR is open? Or are you looking for actual time spent by engineers going through apps?

Thanks your sharing your ideas and thoughts! This helps us build a better product 😄

juan-fernandez avatar Apr 21 '22 15:04 juan-fernandez

@delner I understand your reservations in using Tracing for this. I also noticed caveats like traces are ignored if they occur too far in the past so I can't retroactively send spans. But, given the ease with which I can achieve this now I might be okay with the tradeoffs. I plan to raise this internally in my org to see the traction for it. Perhaps we can reach out through our technical contacts if this takes off.

Hi @juan-fernandez! The event based pipelines sounds like exactly what I need. One thing to note however is that these events don't share any context which is why in my original proposal I custom generate a traceid to link spans together.

  1. The build time is indeed part of a pipeline. I'm guessing when you say pipeline instrumentation you are referring to this If so I haven't tried to install it to see how it works. It might be enough for that case but becomes harder to links events not part of the pipeline. We also have a few different tools used in the pipeline (getting code tested and to prod) which makes this harder to achieve.
  2. To track how long users test apps I'm imagining it would be measured through custom instrumentation again through events. We have a mechanism to internally to track this and if there was an API which I can call with event,duration,parent I would send an event like test-staging,3hours,PR# and then we'll be able to see the span attached to the parent span which would be the PR span. This can also be achieved through metrics but attaching it to the whole pipeline and viewing the bigger picture would be eye opening IMO.

rvignesh89 avatar Apr 27 '22 14:04 rvignesh89