apm
apm copied to clipboard
Reporter for OTel Metrics
Description
Agents provide a metrics reporter that sends the metrics from metric registry provided by the OTel metrics SDK to APM Server.
The specifics on how the implementation looks like may be agent specific but the goal is that users have as little to configure as possible.
Unlike our OTel tracing bridge, metrics will not be implemented as a bridge that translates the OTel API calls into calls to the internal metrics registry. Instead, we rely on the OTel metrics SDK to provide the implementation for the OTel metrics API. Agents will only provide a custom reporter that may be registered automatically (for languages that allow for instrumentation) or programmatically.
Spec Issue
- [ ] https://github.com/elastic/apm/issues/692
Agent Issues
- [ ] https://github.com/elastic/apm-agent-java/issues/2810
- [ ] https://github.com/elastic/apm-agent-dotnet/issues/1843
- [ ] https://github.com/elastic/apm-agent-nodejs/issues/2954
- [ ] https://github.com/elastic/apm-agent-python/issues/1649
- [ ] https://github.com/elastic/apm-agent-go/issues/1328
@AlexanderWert (or anyone) naive question -- is the intention here that a user will add the OpenTelemetry metrics API/SDK to their project, use that to generate metrics, and then
-
The Metrics Reporter will report those metrics into the configured APM Server using APM Server's OTLP endpoint?
-
The Metrics Reporter will report those metrics into APM Server using the traditional metricset endpoint?
My understanding is that if our agent is installed, it will intercept the OTel metrics and send them using the traditional endpoint - but you raise a great question - what stops us from just configuring the OTel SDK to use the OTLP endpoint? Seems like that's a valid solution and much quicker to implement?
A couple of the benefits of using the intake/v2 protocol:
- Consistent metadata (global labels, container ID, cloud region, etc.)
- A single connection to APM Server
A single connection could also be achieved by using OTLP/HTTP. To get consistent metadata while using OTLP, we would need to implement https://github.com/elastic/apm-dev/issues/769#issuecomment-1226839134
Thanks for the insight both @axw, @jackshirazi.
Follow up question for anyone -- OTel has six metric types (Counter, Asynchronous Counter, Histogram, Asynchronous Gauge, UpDownCounter, Asynchronous UpDownCounter)
Has anyone done the work yet to map how these data types would be represented in the intake/v2
protocol?
Has anyone done the work yet to map how these data types would be represented in the
intake/v2
protocol?
I assume that work is effectively encoded in APM Server's support for OTLP incoming metrics, but I haven't looked.
Here is the limited (only tried with a gauge so far) code from my past OnWeek doing this for the node.js APM agent: https://github.com/elastic/apm-agent-nodejs/blob/trentm/onweek3-rearch-metrics/lib/metrics2/index.js#L87-L91
Basically I'm assuming/hoping:
- for counters and gauges it is a simple mapping of OTel
MetricData.dataPoints[].value
to intake/v2 APIsamples[].value
; and - for histograms that the OTel metric data maps one-to-one to intake/v2's
samples[].counts
andsamples[].values
.
I haven't sanity checked any of that though.
I lingering concern/TODO that I had from my OnWeek trial usage of the OTel JS Metrics SDK was what, if any, implications there are with temporality and resets and gaps from the OTel Metrics data model. I'm very naive here.
for counters and gauges it is a simple mapping of OTel MetricData.dataPoints[].value to intake/v2 API samples[].value; and
I think this is fine for now, but we should eventually send the type (counter/gauge) too.
for histograms that the OTel metric data maps one-to-one to intake/v2's samples[].counts and samples[].values.
There are two types of histograms: plain old histogram and exponential histogram.
The more I think about this, the more I think that sending the metrics as OTLP/HTTP and having the server combine it with metadata would be the way to go. That may also not be that simple for agents though, as it would involve changing the way requests are sent (as multipart content, with one part being metadata and one part being the OTLP protobuf).
That may also not be that simple for agents though, as it would involve changing the way requests are sent (as multipart content, with one part being metadata and one part being the OTLP protobuf).
Would it be possible (and maybe easier for agents) to "just" enrich the metrics with metadata on the agent-side and then reuse the OTLP reporter (or some modified version of it)?
Though it would not solve the problem of having an additional connection
Would it be possible (and maybe easier for agents) to "just" enrich the metrics with metadata on the agent-side and then reuse the OTLP reporter (or some modified version of it)?
It would be possible. That would involve translating our metadata to OTel resource attributes. Maybe it's not too bad?
Though it would not solve the problem of having an additional connection
I suppose this bit is language-dependent. For Go I expect we can pass in a net/http.Client
, so they would use the same connection pool. For HTTP/2 connections there'll be multiplexing over the same connection, but there may still be multiple connections due to the long-lived nature of our streaming requests.
Excellent question, Alan, and sorry for jumping in late. The intention of the pitch was to not be too prescriptive in the solution and more about the desired end state which is why this issue may feel a bit under-defined. The intention is to explore the different approaches within the scope of this task and to pave the way for the rest of the agents. I think what got lost from converting the pitch into implementation issues is that for 8.6, we're just planning a POC, and don't necessarily expect to be able to ship a production-ready feature. If we can ship something experimental, that'd be ideal but even that is not a success criteria for a POC. We're working on a more formal definition of what "POC" actually means.
But now is definitely the right time to talk about the different options and their trade-offs. Initially, I was thinking that we'd convert the metrics to the intake v2 format. I'm sure that there are some missing pieces and that we'll need to extend the schema to be able to capture all types of metrics.
Let me try to summarize the pros and cons
Send OTel metrics via Intake v2
- Pros
- It will be at least easier to use a single connection to APM Server and to have consistent metadata
- No additional dependencies
- Can re-use the same reporting infrastructure as the existing metrics support
- Cons
- JSON is not the most efficient format to send metrics
- There may be some missing bits in the intake v2 metrics schema
Send OTel metrics via OTLP
- Pros
- OTLP may be a bit more space-efficient on the wire
- We could potentially re-use some existing code (unless we need to customize the OTLP reporter)
- Cons
- Potentially adds additional dependencies (such as gRPC and the OTel metrics SDK) to the core of the agent
- Potentially additional connections to APM server
- Re-using the same metadata is trickier (either all agents need to do the mapping from intake v2 to OTel semantic attributes or could separate metadata and metrics in a multipart request. But that probably means we need to write a custom exporter.
I'm still leaning towards send OTel metrics via intake v2 but there are lots of unknowns on both sides.
Would it be possible (and maybe easier for agents) to "just" enrich the metrics with metadata on the agent-side and then reuse the OTLP reporter (or some modified version of it)?
It would be possible. That would involve translating our metadata to OTel resource attributes. Maybe it's not too bad?
@axw Is there code in apm-server that is doing the reverse of this (translating OTel resource attributes into our metadata) to support OTLP intake? I'm starting to look at the Node.js agent code for this PoC and would be interested in cribbing from that code if it exists.
@trentm yes, it's here: https://github.com/elastic/apm-server/blob/main/internal/processor/otel/metadata.go
As an alternative to passing metadata via a multipart message, what about passing the whole metadata JSON as a single Resource Attribute -- named elastic_apm_metadata
for example?
(Update: We'd need to modify APM server's OTLP resource attribute code to handle that "special" attribute, of course.)
I'm able to do this with the OTel JS metrics exporter code easily. Here is a pretty-printed protobuf example (sent from the Node.js agent) showing this working: https://gist.github.com/trentm/c93951b4c163b49a1776584adc5ab3c3#file-metrics-data-out-L122-L127 And here is an example showing it working using OTLP/HTTP+JSON: https://gist.github.com/trentm/acb499092c1ebaa79e3ad835095793dc#file-example-otlp-http-json-request-json-L24-L29) I suspect it'll work for OTLP/gRPC too, though I haven't tried.
OpenTelemetry does describe configurable limits on attributes (https://opentelemetry.io/docs/reference/specification/common/#attribute-limits):
... AttributeCountLimit (Default=128) - Maximum allowed attribute count per record; AttributeValueLengthLimit (Default=Infinity) - Maximum allowed attribute value length;
However, resource attributes are exempt:
Resource attributes SHOULD be exempt from the limits described above as resources are not susceptible to the scenarios (auto-instrumentation) that result in excessive attributes count or size. Resources are also sent only once per batch instead of per span so it is relatively cheaper to have more/larger attributes on them. ...
What do others think about this way of getting metadata from APM agents to APM server via OTLP? At least for the JavaScript OTel SDK, this will be much easier than digging into the sending code to make multipart requests.
[AlexW]
Though it [having the APM agents use OTLP to send metrics] would not solve the problem of having an additional connection
[Andrew]
I suppose this bit is language-dependent. For Go I expect we can pass in a
net/http.Client
, so they would use the same connection pool. For HTTP/2 connections there'll be multiplexing over the same connection, but there may still be multiple connections due to the long-lived nature of our streaming requests.
I investigated this for the Node.js agent here: https://github.com/elastic/apm-agent-nodejs/issues/2954#issuecomment-1302592624 tl;dr: I am able to get the OTel Metrics OTLP requests to share the APM agent's connection pool. However because of our long-lived intake-v2 requests, there is a still a separate socket connection for the OTLP requests.
And then after the fact I realized that Andrew had already pointed this out: "but there may still be multiple connections due to the long-lived nature of our streaming requests".
@trentm nice idea. Sending the metadata as a resource attribute is certainly a lot simpler than what I had in mind.
Regarding multiple connections: I think I missed some words before, and that should only apply to HTTP/1.1. In HTTP/2, the requests would be multiplexed as multiple streams.
Would it be feasible for the Node.js agent to use the http2 module's Compatibility API for making requests to APM Server? If so, then for TLS connections to APM Server (e.g. in Elastic Cloud), the agent should be able to negotiate HTTP/2 and minimise the number of connections.
Would it be feasible for the Node.js agent to use the http2 module's Compatibility API for making requests to APM Server?
My understanding is that that Compatibility API is about the server-side -- i.e. supporting creation of an HTTP server in node that can handle incoming HTTP/1.1 and HTTP/2 requests.
But, yes, I can look into getting the Node.js agent to use HTTP/2 for its requests. Going this route will potentially be a lot more work:
- Each of the APM agents would potentially have to do some re-work of their clients to use HTTP/2.
- At least in the JS OTel SDK, the OTLP/HTTP requests are currently using HTTP/1.1. I'll have to look into that code to see if it will be easy enough to replace its lower-level sending code to use HTTP/2 as well -- and to share the connection with the APM agent (including through connection errors/recycling).
Showing my HTTP/2 newb-ness, TIL about "GOAWAY" frames.
Here is a bit of a status update. I haven't looked at this for a week and a half.
The "easy" path so far, from my investigation, is as follows. This requires very little work on the Node.js APM agent side:
- using OTLP/gRPC or OTLP/HTTP+Protobuf
- add an
elastic_apm_metadata
resource attribute with all the APM metadata (as mentioned above) - work on APM server to support parsing and using the
elastic_apm_metadata
- either accepting the extra connections from APM agents, or looking into getting APM agents to use HTTP/2 so intake-v2 and OTLP requests can be multiplexed.
Here are the open questions I intended to look into. Some of these are language-specific, some not.
To use the intake-v2 API for sending OTel Metrics:
- How hard is it to have agents convert the data sent to OTel's Push Metric Exporter -- a
ResourceMetrics
?- I have a start at this from an earlier OnWeek.
- Is intake-v2 missing some support for OTel metric types? My understanding from @axw was that exponential histograms might be the only missing type.
- Measure bandwidth to APM server to compare space efficiency. Unless it is waaay larger than OTLP/gRPC (after compression), then this is a non-issue, I would think.
To use one of the OTLP flavours:
- The simple "use OTLP for sending metrics" approach -- re-using OTel SDK exporters mostly as is -- will mean an additional connection to APM server from APM agents. Is the number of connections to APM server a real concern?
- If additional connections is a concern, how hard would it be to get each APM agent to (a) use HTTP/2 for its intake-v2 usage, and (b) share its connection pool with the OTel SDK's OTLP sending? This may differ for OTLP/HTTP vs OTLP/gRPC.
- Can each APM agent easily add an
elastic_apm_metadata
resource attribute, with the full APM metadata, to sent OTLP data? In the OTel JS SDK this is done by adding it to theResource
passed to theMetricProvider
. - If so, is there a concern with the delay in determining cloud metadata or lambda? At least in the OTel JS SDK, the
MetricProvider
does not support adding resource attributes after creation. I found a workaround for the OTel JS SDK, but that may not be possible for other languages. I discuss this issue and the workaround here: https://github.com/elastic/apm-agent-nodejs/issues/2954#issuecomment-1302584123 - How large of a dep is adding OTLP/gRPC support? OTLP/HTTP+Protobuf? OTLP/HTTP+JSON?
@JonasKunz I understand that you are starting to look at this PoC for the Java agent as well. Let me know if the above makes sense and/or if there is anything we could work on together.
Would it be feasible for the Node.js agent to use the http2 module's Compatibility API for making requests to APM Server? If so, then for TLS connections to APM Server (e.g. in Elastic Cloud), the agent should be able to negotiate HTTP/2 and minimise the number of connections.
@axw I started looking into this. Correct me if I'm wrong: APM server itself does not support HTTP/2, but the cloud proxy does? Or at least APM server does not when I'm accessing it via http://localhost:8200/...
in a local Tilt setup. (Details of the curl ...
examples I was using to get to this conclusion are below.) Some open questions about how HTTP/1.1 - HTTP/2 negotiation might work:
- Perhaps for the cases where agents are talking to APM server over "http" -- dev, testing, APM server on edge(?) -- there isn't any concern with "too many connections" so we'd not bother with HTTP/2?
- When an APM agent is talking to APM server over "https", it probably cannot assume HTTP/2 support. I'm not super confident in Node.js client libraries making HTTP/1.1 -> HTTP/2 ALPN negotiation easy (I might be wrong). An alternative might be to use the APM server version check request (
GET /
) to see if APM server supports HTTP/2, and if so, use HTTP/2 for subsequent intake requests. - Would we need/want to work on the Lambda extension using HTTP/2 and proxying OTLP connections?
Local APM server (running via Tilt) failing on an attempted HTTP/2 request:
% curl -i http://localhost:8200/intake/v2/events -X POST --data-binary @./payload.ndjson -H content-type:application/x-ndjson --http2-prior-knowledge
curl: (92) HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)
An APM server in cloud supporting HTTP/2 via ALPN negotiation (use -v
to see the ALPN negotiation):
% curl -i https://my-deployment-31a70c.apm.us-west2.gcp.elastic-cloud.com/intake/v2/events -X POST --data-binary @./payload.ndjson -H content-type:application/x-ndjson -H 'Authorization: Bearer ***'
HTTP/2 202
date: Wed, 16 Nov 2022 21:19:34 GMT
x-cloud-request-id: wbYJXidQQUSPwyXCzbHBRQ
x-content-type-options: nosniff
x-found-handling-cluster: cdfdba50f7a44a32981916b0faf9a7a2
x-found-handling-instance: instance-0000000000
content-length: 0
Correct me if I'm wrong: APM server itself does not support HTTP/2, but the cloud proxy does?
I found https://github.com/elastic/apm-server/blob/main/dev_docs/otel.md#muxing-grpc-and-http11 and I suspect I'm hitting this:
For h2c, gmux assumes all h2c requests are for gRPC and sends them on to the gRPC server. We could also perform the Content-Type header check there, but we do not support h2c apart from gRPC.
I was attempting to use h2c for non-gRPC.
Crazy idea: what about going gRPC for intake-v2 data?
@trentm as you've found, we do support HTTP/2 but it more or less requires TLS.
Perhaps for the cases where agents are talking to APM server over "http" -- dev, testing, APM server on edge(?) -- there isn't any concern with "too many connections" so we'd not bother with HTTP/2?
I think that's fair to say.
When an APM agent is talking to APM server over "https", it probably cannot assume HTTP/2 support. I'm not super confident in Node.js client libraries making HTTP/1.1 -> HTTP/2 ALPN negotiation easy (I might be wrong). An alternative might be to use the APM server version check request (GET /) to see if APM server supports HTTP/2, and if so, use HTTP/2 for subsequent intake requests.
It can't assume HTTP/2 support, e.g. because there could be a reverse proxy. ALPN is the expected way of dealing with this, but I can't comment on Node.js support.
Would we need/want to work on the Lambda extension using HTTP/2 and proxying OTLP connections?
Good point, I hadn't thought about the Lambda extension. It doesn't currently support OTLP at all. That's another con for the OTLP approach/pro for intake-v2.
Crazy idea: what about going gRPC for intake-v2 data?
@graphaelli looked into this a couple of years ago, and I think @marclop may have looked at it recently too. IIRC, one issue we found is that protobuf is slower to encode in Node.js than JSON, by virtue of the runtime having native JSON encoding support. I don't know if that has changed. If not, it's another con for the OTLP approach for metrics -- but probably not such a big deal if limited to metrics, which would not be high throughput.
one issue we found is that protobuf is slower to encode in Node.js than JSON
That's probably still true. I have a note to do some CPU load comparisons.
I might be a little late to the party, but I started investigating things from the Java side yesterday.
I'm trying the "easy path" as a starter as well: Use the Otel metrics SDK + Otlp exporter. Though my PoC is not working yet because I'm still fighting the classloading, here are some early findings:
Dependency Sizes
- The otel-metrics-sdk itself comes with no transitive dependencies and only a handfull of classes. Using it vs implementing the otel metrics API ourselves wouldn't really make a difference in terms of agent binary size.
- The OTLP exporters all come in a single package. They pull in
OkHttp
and kotlin standardlibaries as well. This increases the agent binary size from ~10mb to ~13,5mb.
Http/2 communication
- The elastic agent currently uses the ancient
HttpUrlConnection
API for communication with the APM server because it comes without additional dependencies and is available in Java7/8. To my knowledge this API does not support Http/2. - To get both our Intake-API-connection as well as the Otlp-Exporter running on the same TCP-connection, we would need to run both with the same
OkHttpClient
instance- We would need to rewrite our agent-core to use
OkHttp
for the Intake-API communication - We would need to supply the shared
OkHttpClient
to the Otlp-Exporter: Their API currently does not allow this!
- We would need to rewrite our agent-core to use
So to summarize, it seems very hard to get both exporters running on the same TCP connection via Http/2.
I was therefore thinking of the same middleground between converting the data to IntakeV2 and sending the data to the APM server's OLTP endpoint:
Crazy idea: what about going gRPC for intake-v2 data?
I was thinking of just sending the OTLP protobuf messages via the IntakeV2 API (via something like a otlp
event type).
For binary protobufs we could encode the bianry messages as base64 strings, which should be reasonable efficient on the wire after compression.
one issue we found is that protobuf is slower to encode in Node.js than JSON
I stumbled across the fact that there is a JSON protobuf encoding, though it is experimental and I haven't looked deeper into it yet. Is this maybe an option?
I stumbled across the fact that there is a JSON protobuf encoding, though it is experimental and I haven't looked deeper into it yet. Is this maybe an option?
Yah, that is a possible option. I can access that code in the OTel Metrics SDK without cheating. It would perhaps be sketchy to rely on stability of this JSON encoding until it is "stable".
The proposal here, then, might be:
- use OTel Metrics SDK to gather metrics and convert to its Protobuf-JSON encoding
- send those via intake-v2 as an alternative to "metricset" objects (call them "otelresourcemetrics" intake events, say)
- update APM server's intake-v2 handling to accept those "otelresourcemetrics", hopefully re-using some OTLP support code
This would mean:
- the (Node.js) APM agent wouldn't have to know how to convert between OTEL and our metrics formats;
- conversion performance for the APM agent would presumably be better (native
JSON.stringify
over protobufjs code); - re-use of the intake-v2 connection;
- the size overhead for the Node.js APM agent is much reduced (the largest hit is from the Protobuf exporters);
- APM server bears the brunt of the work (adding "otelresourcemetrics" support to intake-v2)
Update: For users, it means they need to have an updated agent and APM server and the agent cannot send those "otelresourcemetrics" until it has successfully sniffed that the APM server version is new enough. If the APM agent converted to "metricsets" and used intake-v2, then the user just needs an updated agent version -- which is slightly nicer.
Encoding OTel metrics inside intake-v2 is an interesting idea. Given that the JSON encoding is experimental. I'm a bit leery of depending on that. Base64-encoded protobuf feels like it could be bit of a pain for debugging, but technically fine, and should be more stable.