vector icon indicating copy to clipboard operation
vector copied to clipboard

Datadog sinks batch together events with different API keys

Open byronwolfman opened this issue 1 year ago • 7 comments

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We encountered an interesting batching issue when using the Datadog Agent source's store_api_key parameter. When enabled, Vector will use this to submit those events back to Datadog if the event ends up routed to a Datadog sink.

When a Vector cluster is deployed in an aggregator topology, it is possible for different Datadog agents to submit payloads with different Datadog API keys. This isn't a real problem if all of the agents are expected to submit data to the same Datadog organization, and if all of the API keys are valid. However, it seems to be possible that the sink will batch together events with different API keys, which can be a problem if one of them is revoked. We encountered a situation where a revoked API key ended up blocking all submission, I suspect because the following happened (values are arbitrary):

  1. 1000 events arrived with a usable API key
  2. 1 event arrived with a bad/revoked API key
  3. All 1001 events were batched together at the Datadog Log sink
  4. Vector chose the bad/revoked API key to submit all 1001 log events
  5. Datadog's API returned HTTP 403

In effect, the bad API key "polluted" the sink such that no events could be submitted, because even those that arrived at the Vector source with a good API key ended up being batched with at least one event with a bad API key.

In the end we mitigated this by disabling store_api_key and asserting a working key, but this isn't ideal in our particular scenario (which is not to say it isn't a perfectly good approach in others).

It's possible that this behaviour is correct and expected in which case I think the only thing that's needed is documentation to describe the potential pitfalls of using store_api_key and what that can mean for batching. If this is unintended behaviour though, then it's possible that the Datadog sinks should batch events by API key so that there is no cross-contamination of revoked API keys, or data being sent to the wrong Datadog tenant in the case of multiple working API keys.

Configuration

data_dir: /tmp
schema:
  log_namespace: true

sources:
  datadog:
    type: datadog_agent
    address: "0.0.0.0:8080"
    disable_traces: true
    multiple_outputs: true
    parse_ddtags: true
    store_api_key: true # use Datadog API key from agent submission

sinks:
  datadog_logs:
    type: datadog_logs
    inputs: ["datadog.logs"]
    compression: gzip
    default_api_key: "${DATADOG_FALLBACK_API_KEY:-none}"
    acknowledgements:
      enabled: true
    healthcheck:
      enabled: false

  datadog_metrics:
    type: datadog_metrics
    inputs: ["datadog.metrics"]
    default_api_key: "${DATADOG_FALLBACK_API_KEY:-none}"
    healthcheck:
      enabled: false

Version

vector 0.40.2 (x86_64-apple-darwin 170687d 2024-09-09 14:07:09.232169228)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

byronwolfman avatar Oct 02 '24 20:10 byronwolfman

Hi @byronwolfman ,

Thanks for this detailed write-up!

The datadog_logs sink actually partitions batches by API key (that is, all events in a given batch will have the same API key). You can see this partition implementation here: https://github.com/vectordotdev/vector/blob/614a0147af88b38c957cf8b274aa75892192dabd/src/sinks/datadog/logs/sink.rs#L20-L29

Now, I see the behavior you saw is that an event with an invalid API key blocked all requests. I think what might have happened here is that the request with the invalid API key ended up causing the adaptive request concurrency mechanism to believe that it should back off the number of concurrent requests until there was only one concurrent request allowed, the invalid one, which caused it to block the pipeline while it continuously retried. Did you happen to be capturing Vector's internal metrics at the time you observed this behavior? I'd be interested in seeing what the value of adaptive_concurrency_limit was. If I'm right, a workaround would be to instead set a static concurrency via request.concurrency = X (e.g. 10).

cc/ @bruceg does this sound plausible? I think this could be a deficiency of the ARC implementation, if so.

jszwedko avatar Oct 02 '24 21:10 jszwedko

Hi @jszwedko,

Thanks for following up so quickly! That's great that the sink takes into account the API keys, even if it blows my theory out of the water. ARC sounds like a possible factor here -- our team has had some encounters with it when it comes to bursty workloads. As it happens, we're capturing those internal metrics, though they looked kind of weird during the impact period. I've attached the metric as both a line graph and a bar graph, since even though a bar graph is technically not the "correct" type, it shows something which the line graph does not:

Screenshot 2024-10-02 at 5 26 25 PM
Screenshot 2024-10-02 at 5 26 37 PM

vector.adaptive_concurrency_limit is usually pretty stable for our fleet. It fluctuates a bit, but not much. The ramp-up in the graph is anomalous and coincides with an API key being revoked. Halfway into its ascent is when Vector begins discarding events, and then a few minutes later Vector stops emitting the vector.adaptive_concurrency_limit metric. In fact, all ARC metrics disappear around this time (though Vector continues emitting other metrics). That's kind of neat but weird!

ARC metrics briefly return every time we cycle the Vector fleet. The concurrency limit ramps up, and then disappears. That's another interesting correlation: the Datadog sink briefly resumes working when the fleet is cycled, but only temporarily. I had guessed that this was a case of new workers initially not having the revoked API key in their buffer, but that was when my theory was still that Vector did not partition these in the sink.

In any case, please let me know if you'd like me to provide more details. Alternatively: I recognize that GitHub issues isn't really a place to submit support requests, and this no longer seems to be a bug report about the sink not partitioning events by API key. If Vector appears to be behaving as expected and this issue should be closed, then I completely understand.

But I love a good mystery, so if this is a thread you'd like to pull on I'm happy to help.

byronwolfman avatar Oct 02 '24 21:10 byronwolfman

Something else I forgot to mention because it is potentially important: we emit vector.* metrics out-of-band from the Datadog source and Datadog sinks, since it would be pretty embarrassing for those components to fail and for us to loose visibility. Internal vector.* metrics and logs travel along different components, which is why we didn't lose visibility into the fleet, but also why the missing vector.adaptive_concurrency_limit stuff is puzzling!

byronwolfman avatar Oct 02 '24 21:10 byronwolfman

cc/ @bruceg does this sound plausible? I think this could be a deficiency of the ARC implementation, if so.

That sounds very plausible, yes, particularly if setting a fixed concurrency limit resolves the problem.

ARC metrics briefly return every time we cycle the Vector fleet. The concurrency limit ramps up, and then disappears. That's another interesting correlation: the Datadog sink briefly resumes working when the fleet is cycled, but only temporarily. I had guessed that this was a case of new workers initially not having the revoked API key in their buffer, but that was when my theory was still that Vector did not partition these in the sink.

The ARC metrics are emitted as part of the process of issuing requests and handling their responses. When there are no requests, new values for the metrics stop being emitted.

bruceg avatar Oct 02 '24 23:10 bruceg

Thanks @bruceg . It sure looks like requests were sent, as vector.http_client_requests_sent_total showed continued (if somewhat depressed) outbound requests. Here's that graph overlaid on top of vector.http_client_responses_total by HTTP status; in both cases, the metric has been scoped to only the impacted Datadog Log sink:

Screenshot 2024-10-03 at 9 28 12 AM
Screenshot 2024-10-03 at 9 28 45 AM

These are what ultimately led us to the "poison API key" theory, since all requests received HTTP 403, instead of only a portion.

I think we have some good info from this issue and some potential changes we can make on our side (e.g. static request concurrency). I'll put the ball back in your court: if this smells like a potential bug, I'm happy to leave the issue open. If this looks to you like Vector is acting as expected, I'm good to close it at this point.

byronwolfman avatar Oct 03 '24 13:10 byronwolfman

Thanks for the additional details @byronwolfman.

I'm curious to hear if setting a static concurrency resolves the issue for you. If it does, then I'm relatively confident that the invalid API key will cause ARC to limit concurrency in an unexpected and undesirable way so I'm ok with leaving this open to resolve. I think we may need to have ARC partition its state along the same partitioning used for requests that it is optimizing the concurrency of. In which case we can adapt this issue to be to fix that behavior.

jszwedko avatar Oct 04 '24 22:10 jszwedko

Sounds good! We have decided to do science: our team will be re-creating the incident scenario in a more isolated environment, this time using static concurrency. I will report back when we have results!

byronwolfman avatar Oct 04 '24 22:10 byronwolfman

Sorry to leave you hanging! I bring good news. @jszwedko's theory is correct: it was ARC. To test this, we recreated the same conditions that precipitated the incident but replaced ARC with request.concurrency = 10 in the sinks. When payloads were submitted using the bad/poison API key, those payloads got stuck as expected, but all other payloads submitted with valid Datadog API keys flowed through without any friction.

This means a few things:

  • The title of this issue is wrong!
  • But a bad API key can cause ARC to self-destruct

Given that, let me know if you'd like me to close the issue as being user error, keep it open but re-titled to track a potential bad behaviour in ARC, or something else.

byronwolfman avatar Nov 04 '24 19:11 byronwolfman

Thanks @byronwolfman ! It sounds like our suspicions were correct then. I created a separate issue to follow up on the fact that ARC maintains a global state across "partition keys" like the Datadog API key here: https://github.com/vectordotdev/vector/issues/21725 . I think we can close this one and use the other for tracking.

Appreciate you raising this to our attention! For now, I'd recommend setting a static concurrency if you anticipate some API keys may be invalid to avoid bottlenecking.

jszwedko avatar Nov 06 '24 21:11 jszwedko

Re-opening this, but I can create a new issue if that's preferred.

I'm not sure if there's a regression or something else has changed, but we ran into this again recently. A bad/poison API key was used to start sending events which affected payloads with good API keys.

Interestingly, the behaviour is not consistent for the log/metric sink: when the poison API key was introduced, the log sink began complaining of blanket 403s, but the metric sink continued working just fine. Here's what those sinks currently look like:

[sinks.datadog_logs]
type                = "datadog_logs"
inputs              = ["terminal_datadog_logs"]
compression         = "gzip"
default_api_key     = "${DATADOG_FALLBACK_API_KEY:-none}"
healthcheck.enabled = false

# Apply backpressure to ensure log delivery
acknowledgements.enabled = true

# Use static concurrency so bad payloads don't cause Vector to scale concurrency to zero
request.concurrency = 10

[sinks.datadog_metrics]
type                = "datadog_metrics"
inputs              = ["terminal_datadog_metrics"]
default_api_key     = "${DATADOG_FALLBACK_API_KEY:-none}"
healthcheck.enabled = false

# Do not enable unless you have measured all round trip latency _very_ carefully
acknowledgements.enabled = false

# Use static concurrency so bad payloads don't cause Vector to scale concurrency to zero
request.concurrency = 10

Some interesting differences:

  • We use e2e acks on logs, but not metrics
  • Static request concurrency is used on both (no ARC)
  • gzip compression on logs, but not metrics -- shouldn't make a difference, maybe it does?

byronwolfman avatar Jun 30 '25 18:06 byronwolfman

Well looks like I don't have permission to re-open. Let me know what you'd like me to open a new one, or if you're ok to re-open this one yourself!

byronwolfman avatar Jun 30 '25 18:06 byronwolfman

Hi @byronwolfman, please create a new issue and supply the necessary context so we can take a fresh look. Thanks!

pront avatar Jun 30 '25 18:06 pront