micrometer-docs icon indicating copy to clipboard operation
micrometer-docs copied to clipboard

API -> DSD as default configuration

Open jeremy-lq opened this issue 3 years ago • 11 comments

Updating the default configuration to DogStatsD, which is more efficient and has other advantages.

jeremy-lq avatar Feb 24 '22 16:02 jeremy-lq

It's worth noting that @jeremy-lq is a member of the Datadog org.

Here's the context for the existing advice on that page: https://github.com/micrometer-metrics/micrometer-docs/pull/93#issuecomment-915734481

The StatsD approach requires publishing each measurement out of process to the StatsD daemon in approximately real time, whereas the API publishes once per step interval per time series. If you increment a counter 100 times per second, you need to publish 100 statsd lines per second for it. If you are using the API, you publish 1 time series per step interval for that counter regardless of how many times it is incremented during the step.

@jeremy-lq is the information in that thread correct/incorrect/misleading/incomplete?

rehevkor5 avatar Feb 24 '22 18:02 rehevkor5

👋 howdy :) while that's true in the default case we have this wonderful doc on high throughput metrics with multi-language (including java) support https://docs.datadoghq.com/developers/dogstatsd/high_throughput/ see the java tab for an example. In addition communicating over UDS can potentially offer additional performance improvements.

What's happening for large deployments using this library in the directly-via-api option, is poor behavior in the update metric metadata loop. It continuously retries ignoring 4xx (404, 400, and 429) responses leading to an impact into customer IP-Rate Limits.

platinummonkey avatar Feb 24 '22 19:02 platinummonkey

Sorry for my confusion but I'm not sure I get how publishing each measurement can be more efficient than sending an aggregate. As far as I understand buffering will not decrease the number of measurements and sampling is a trade off between decreasing traffic and losing data (UDP is a similar trade off between throughput and losing data).

Also, what are the other advantages?

jonatan-ivanov avatar Feb 25 '22 19:02 jonatan-ivanov

The agent already aggregates and reduces APIs. It also has additional logic to ensure data is delivered and properly retried to our intake systems. In addition agents can be configured to dual report metrics to multiple tenant instances if the user desires without making sweeping code changes.

This isn't to say you couldn't still aggregate client side and push to the agent directly. This guide gives details on how the aggregation is performed to ensure compatibility. But also this is built-into the Statsd Client already: https://docs.datadoghq.com/developers/dogstatsd/high_throughput/#client-side-aggregation

Also the agent intake has elevated limits when accepting payloads that do not count against User/Automation queries on the general APIs currently being consumed by the API Method option.

platinummonkey avatar Feb 25 '22 19:02 platinummonkey

I suppose the other thing to consider when using dogstatsd is that you have to configure, deploy, and manage the Agent as a separate process, whereas the other option works directly in the app without need to set up a separate process.

The weakness of some of those aggregations is that the data is not very robust to (undetectable?) data loss. For example, if you report the difference in count since last report, rather than the count itself, you lose data when you miss a sample for any reason instead of just losing resolution/granularity.

It continuously retries ignoring 4xx (404, 400, and 429) responses leading to an impact into customer IP-Rate Limits.

It certainly seems like https://github.com/micrometer-metrics/micrometer/blob/main/implementations/micrometer-registry-datadog/src/main/java/io/micrometer/datadog/DatadogMeterRegistry.java#L133 needs to be updated with backoff logic based on the response headers explained here https://docs.datadoghq.com/api/latest/rate-limits/ That wouldn't solve all problems with reporting directly to the API, but it could help a lot of people with rate limiting problems.

I can't find places where HttpSender is continuously retrying after 4xx responses. It seems to just log a message https://github.com/micrometer-metrics/micrometer/blob/main/implementations/micrometer-registry-datadog/src/main/java/io/micrometer/datadog/DatadogMeterRegistry.java#L138 and that's all. But I'm not very familiar with the code.

aggregate client side and push to the agent directly

It seems like StatsdCounter could be updated to be more similar to StatsdFunctionCounter. Specifically, it could become a StatsdPollable and report the count delta to its sink only when polled, rather than on every call to increment(). That would improve the efficiency of the dogstatsd approach within the app. Does that sound right, or is something preventing that? (We can move these implementation discussions over to https://github.com/micrometer-metrics/micrometer/ if that's more appropriate.)

rehevkor5 avatar Feb 26 '22 22:02 rehevkor5

@platinummonkey

The agent already aggregates and reduces APIs.

I guess that helps from the backend perspective but if I understand it correctly it helps nothing to the client (user's app), they still need to send every data point separately so I think I still not get how publishing each measurement can be more efficient than sending an aggregate from the client's/user's/Micrometer's perspective.

This is basically where I would like to have some clarification:

The DogStatsD approach is far more efficient, especially if you have a high volume of metrics.

Since to me it seems that StatsD is far less efficient, especially if you have a high volume of metrics (please notice that Micrometer does not use the DataDog StatsDClient).

jonatan-ivanov avatar Mar 01 '22 00:03 jonatan-ivanov

It seems these performance tests are against micrometer's own implementations of aggregation via custom API calls, vs it's own implementation of statsd line protocol to mimic some but not all Datadog's official statsd client.

please notice that Micrometer does not use the DataDog StatsDClient

Since this is a datadog stastd plugin I'm confused why it wouldn't use the datadog official Datadog java statsd client? The docs seem to imply multiple re-inventions of the official implementation details but misses on aggregation? The official Datadog statsd client has client side aggregation that is simple to enable. That aggregated client mode sends an aggregated response the the local agent, not each individual point submission. The agent itself is quite lightweight even under heavy data load, and already installed on the affected customers machines to benefit from other observability data.

not needed on the classpath for this to work, as Micrometer uses its own implementation

Why was this needed? What was missing or didn't work?

The agent then does further aggregation where appropriate and uses an optimized submission payload against intake endpoints, including retry and backoff behavior that results in stability of points, logs, traces and profiles being received. The agent itself is very lightweight and benefits by also providing tracing, jvm profiling, logging and a host of other features.

It seems micrometer is trying to reinvent this logic, but only for metrics. At high load it is now impacting users ability to use Datadog due to the non optimized intake submission, and infinite retry patterns for updating metric metadata on 404, and no back off on 429 responses. So now not only will they not get their data, they won’t be able to find out why either without intervention from Datadog Support.

platinummonkey avatar Mar 01 '22 15:03 platinummonkey

Since this is a datadog stastd plugin I'm confused why it wouldn't use the datadog official Datadog java statsd client?

Micrometer supports multiple backends that has StatsD support. In these cases, we use Micrometer's common StatsD registry and we define a "flavor" that affects the naming convention and the format (line protocol) we send on the wire.

Please notice that we don't have a DataDog StatsD registry. We have a common StatsD registry with a DataDog "flavor" and also a specific DataDog registry that uses the HTTP API.

Why was this needed? What was missing or didn't work?

I think this was implemented this way because DataDog was not the first registry that has StatsD support and it was easier to add a DataDog flavor to the existing StatsD registry.

The agent then does further aggregation where appropriate and uses an optimized submission payload against intake endpoints, including retry and backoff behavior that results in stability of points, logs, traces and profiles being received. The agent itself is very lightweight and benefits by also providing tracing, jvm profiling, logging and a host of other features.

I'm not sure I understand what is the connection between the agent and Micrometer. Micrometer does not know about the agent and should work without it. Btw JVM agents do not work in native images.

It seems micrometer is trying to reinvent this logic, but only for metrics. At high load it is now impacting users ability to use Datadog due to the non optimized intake submission, and infinite retry patterns for updating metric metadata on 404, and no back off on 429 responses. So now not only will they not get their data, they won’t be able to find out why either without intervention from Datadog Support.

I guess micrometer is just using pure StatsD as it does for other registries, it does not reinvent anything, it's quite the opposite, it is using the de-facto standard instead of a custom StatsD client.

Path forward: Since I still think that in our case StatsD is far less efficient, especially if you have a high volume of metrics, what do you think about:

  1. Closing this issue since what its description states does not seem to be true
  2. Opening an issue to explore the possibilities to use DataDog's StatsDClient
  3. Opening an issue to implement retry and backoff for our HTTP client (could be useful for other registries too)

jonatan-ivanov avatar Mar 02 '22 21:03 jonatan-ivanov

they still need to send every data point separately

At the risk of repeating myself, why not change StatsdCounter into a StatsdPollable so it reports the count delta to its sink only when polled, rather than on every call to increment()? I guess I will open an issue in the main repo to suggest that.

Btw JVM agents do not work in native images.

That is a good point. But I am not sure if we are all talking about the same "agent". Sadly, there is a Datadog JVM agent described here https://docs.datadoghq.com/tracing/setup_overview/setup/java/?tab=containers , but there is also the "Datadog Agent". The terminology is indeed confusing. In general, if you are using Micrometer to send metrics to Datadog via statsd, then you are sending it to the Datadog Agent which is running as a separate process. If you are using the Datadog JVM agent, you definitely must also run the Datadog Agent.

Datadog's documentation is somewhat inconsistent about how to send metrics from Java. In this page it says to expose metrics via JMX: https://docs.datadoghq.com/integrations/java/?tab=host However in these pages https://docs.datadoghq.com/developers/dogstatsd/?tab=hostagent#pagetitle for whatever reason they recommend dogstatsd instead. Also interesting to not that the page listing libraries https://docs.datadoghq.com/developers/community/libraries/ does not include Micrometer :)

Closing this issue since what its description states does not seem to be true

Maybe we can agree that it is debatable, as I think this thread proves. Let me add some flavor about the earlier statement that "At high load it is now impacting users ability to use Datadog". Yes, with the current implementation there are some good things about reporting directly to the API. If you are a small user with a low rate of API requests, you probably won't have any problems. But, on the other hand, if you are a big user of Datadog with many different apps all following the advice on this page and sending Datadog metrics to the API, you may certainly end up with problems and it will require a lot of work across your apps to fix them. You may start getting back 429 responses from API queries that were previously working even for reads, not only for sending metrics. For whatever reason, those might not even have a response body or rate limiting headers, so you can't do proper backoff. Also, the rate limiting appears to be at the account level not the API Key level or App Key level. Therefore, even if one app is sending too rapidly, that could break all your other well-behaved apps.

Sadly, I see no information about that on https://docs.datadoghq.com/api/latest/metrics/#submit-metrics (linked from https://docs.datadoghq.com/metrics/custom_metrics/ ), and the info on https://docs.datadoghq.com/api/latest/rate-limits/ does not indicate that metrics submission rate limit can impact clients that are not sending metrics.

Now, obviously these are all issues which should be dealt with better on the Datadog side. But for users of Micrometer, we should try to give them responsible advice based on the info we have available. Personally, I would not in good conscience recommend to a friend or coworker that everyone use API based reporting simply because it might be more efficient.

Therefore, perhaps a more nuanced explanation is needed, so that we can agree on a change which is helpful to the community? Specifically, we could mention that while there are some efficiency benefits in using API based reporting, that you need to consider the possibility of reaching account-level API rate limiting if you are reporting metrics directly to the API instead of via statsd+Datadog Agent.

rehevkor5 avatar Mar 04 '22 21:03 rehevkor5

I can't contribute to the efficiency debate here (their seem to be reasonable arguments on both sides), but as a user, who recently ran into trouble due to using micrometer in the default configuration and some unfortunate circumstances, I would like to add two cents about the documentation aspect.

With respect to the sentence "The API approach is far more efficient if you need to choose between the two.": it would be nice if the docs could be a bit more elaborate here. For example, from the comments above I understand that the efficiency concern is mainly about "high volume of metrics" situations (because of IPC involved and metrics not being batched) and it seems that this would be a helpful information for tradeoff-decisions. Even better if it could be quantified somehow what high volume means and what problems a user can expect.

(On a slightly related side-note: since the dogstatsd-approach supports distribution metrics and the API approach currently doesn't, its pretty easy to get into the situation having to chose between the two :)).

aptituz avatar Mar 21 '22 15:03 aptituz

I've submitted https://github.com/micrometer-metrics/micrometer/pull/3329 to address the technical implementation. Will circle back on the docs once that is resolved.

platinummonkey avatar Aug 03 '22 15:08 platinummonkey

Hey folks, here we are over a year later 🎂. The https://github.com/micrometer-metrics/micrometer/pull/3329 PR is still open and looking for reviews.

We're continuing to see many customers run into issues of infinite spam from the API-driven integration over the agent-based stats reporting.

As an additional note: It appears the API-driven approach seems to be lossy under these circumstances :warning: Also in some cases customers have limited NAT gateways which results in much of their application experience being blocked at rate limits are reached :warning:

platinummonkey avatar Jul 24 '23 13:07 platinummonkey

See https://github.com/micrometer-metrics/micrometer/pull/4283

platinummonkey avatar Oct 26 '23 19:10 platinummonkey

New docs hosted by DataDog are here https://docs.datadoghq.com/metrics/guide/micrometer/

platinummonkey avatar Nov 14 '23 13:11 platinummonkey

Let me close this since we moved the docs to their respective projects and we are still planning to look into using Dogstatsd in the DatadogMeterRegistry sometimes later (as far as I remember there were some issues in that implementation that should be fixed). After that we can call that out in our docs.

jonatan-ivanov avatar Jan 23 '24 22:01 jonatan-ivanov

Let me close this since we moved the docs to their respective projects and we are still planning to look into using Dogstatsd in the DatadogMeterRegistry sometimes later (as far as I remember there were some issues in that implementation that should be fixed). After that we can call that out in our docs.

Given that perfectly compatible Prometheus and OTel plugins exist in micrometer as alternatives this documentation link ( https://docs.datadoghq.com/metrics/guide/micrometer/ ) will remain the source of truth for Datadog and will be our first set of guided instructions for anyone running into the documented issues linked here.

The suggested PRs haven’t been accepted and are now closed but not solved.

platinummonkey avatar Jan 24 '24 10:01 platinummonkey

I created https://github.com/micrometer-metrics/micrometer/pull/4871 to try to include a note for the Datadog API rate limit problem that has been discussed here.

izeye avatar Mar 24 '24 16:03 izeye