opamp-spec icon indicating copy to clipboard operation
opamp-spec copied to clipboard

Decide if the protocol can be used to manage SDKs

Open tigrannajaryan opened this issue 4 years ago • 12 comments

Can we use the same protocol to manage Otel SDKs?

Currently applicable parts:

  • Remote Configuration Management (yes)
  • Status Updates (unclear)
  • Addons (no)
  • Connection Settings Managements (maybe)
  • Addons (no)
  • Agent package update (no)

tigrannajaryan avatar Nov 09 '21 21:11 tigrannajaryan

One thing that we will need to change if we want to manage SDK is the following current spec requirements:

The Resource in the reported telemetry SHOULD describe the Agent in the following way:

  • service.instance.id attribute SHOULD be set to Agent's instance_uid that is used in the OpAMP messages.
  • service.name attribute SHOULD be set to Agent's type, matching the value of agent_type field in AgentDescription message.
  • service.version SHOULD be set to Agent's version, matching the value of agent_version field in AgentDescription message.

These are written with the assumption that the Agent is a standalone executable and we can call it a "Service".

However this is not true for embedded agents, such as Otel SDK. Calling an Otel SDK a "Service" and requiring it to output its own metrics with "service.*" attributes would be very confusing. For an Otel SDK a more natural way would be to emit its own metrics for example with the following attributes:

  • telemetry.sdk.name (this already exists as Otel semantic convention).
  • telemetry.sdk.version (we can add this as a new Otel semantic convention).
  • telemetry.sdk.instance.id (we can add this as a new Otel semantic convention and require it to be globally unique, so technically this one alone is sufficient as an identifying attribute).

To make such usage possible with OpAMP we need to modify OpAMP spec in the following way (this is just one possible way, there are probably others):

  • In AgentDescription message record identifying and non-identifying attributes separately. For example for a standalone Agent the following would be identifying attributes: service.instance.id, service.name, service.version, while e.g. os.type would be non-identifying.
  • For Own Telemetry Reporting instead of requiring the hard-coded list of attributes (service.instance.id, service.name, service.version), instead require that the identifying attributes that are included in the AgentDescription message MUST be also used as attributes in own telemetry. This gives the flexibility for the agents to use any attributes they want to identify themselves, yet still clearly establishes the link between OpAMP's Agent and its own telemetry collected elsewhere.

@open-telemetry/opamp-spec-approvers what do you think?

tigrannajaryan avatar Nov 30 '21 22:11 tigrannajaryan

@MrAlias @jmacd what do you think about the above comment? Would it make sense for Otel SDK to follow the conventions like that and use OpAMP to receive the config?

tigrannajaryan avatar Dec 01 '21 23:12 tigrannajaryan

@tigrannajaryan Think this should be brought to the maintainers call, just to get an initial vibe from the community? Or too early perhaps?

carlosalberto avatar Dec 02 '21 06:12 carlosalberto

@carlosalberto it is probably early for detailed discussion but it is not a bad idea to create more awareness at least. Let's mention this in the next maintainers call.

tigrannajaryan avatar Dec 02 '21 17:12 tigrannajaryan

@open-telemetry/opamp-go-approvers I need your help on this. Please comment and tell what you think.

I submitted a draft PR which shows the relevant changes to allow OpAMP to work with SDKs: https://github.com/open-telemetry/opamp-spec/pull/48

tigrannajaryan avatar Dec 06 '21 15:12 tigrannajaryan

This is interesting idea. I am somewhat concerned that it will increase the complexity, but perhaps it's worth it. It would make management of common parameters (e.g. sampling rate, enabled/disabled instrumentations, etc.) much more convenient

pmm-sumo avatar Dec 06 '21 16:12 pmm-sumo

This is interesting idea. I am somewhat concerned that it will increase the complexity, but perhaps it's worth it.

To understand how much complicated this is I modified my private prototype implementation according to this change and found that it increases the implementation complexity only a little bit. I ended up changing less than a 100 lines of code combined in the client and server implementation (compared to the implementation that uses a fixed agenttype/agentversion pair of fields).

tigrannajaryan avatar Dec 06 '21 16:12 tigrannajaryan

I think we absolutely need a standard configuration format for SDKs and I think that once we have that it would be nice to be able to manage that configuration through the same remote management system as the collector.

Aneurysm9 avatar Dec 08 '21 20:12 Aneurysm9

I support the proposals in https://github.com/open-telemetry/opamp-spec/issues/16#issuecomment-983077725 that, roughly speaking, accepts OpAMP as a way to manage client SDKs by adding support for variable telemetry "schemas". Could we use the schema_url aspect of the OTel client configuration as a way to describe the agent's telemetry schema, too?

Status Updates (unclear)

About status updates, this somehow tickles the idea that we could use one of the existing data (OTLP logs, metrics, or spans) to convey status. I went through an exercise of making a "very observable" agent last year, using OTLP for https://github.com/lightstep/opentelemetry-prometheus-sidecar#diagnostics.

In that package, I used an OTLP span to convey a status report. Could a telemetry schema be developed for status reports, instead of a dedicated field? Maybe this is a good-in-theory, bad-in-practice suggestion, but may I ask -- what benefit is there in having status reports be a new data type?

jmacd avatar Dec 10 '21 20:12 jmacd

In that package, I used an OTLP span to convey a status report. Could a telemetry schema be developed for status reports, instead of a dedicated field? Maybe this is a good-in-theory, bad-in-practice suggestion, but may I ask -- what benefit is there in having status reports be a new data type?

@jmacd "Status Reports" as they are defined in OpAMP right now are identity reports plus the state of "applying" what the Server offered the Agent to do (e.g. "I applied that config that you gave me successfully"). The Agent self-describes its identity and the identity values can be used by the Server to tailor the rest of its behavior appropriately. Perhaps the "Status Report" name is not conveying this properly (this was raised before, but I wasn't able to find a better name). We could report this same data in OTLP form, but it would require defining lots of semantic conventions and I don't see the benefit of doing that, it would make things more convoluted and less efficient, not to mention that OTLP does not define WebSocket as a transport, which is essential to OpAMP's certain capabilities.

As opposed to "Status Reports" the "telemetry" part of Agent's state is reported via OTLP/HTTP and OpAMP merely tells the Agent where the OTLP destination is.

Could we use the schema_url aspect of the OTel client configuration as a way to describe the agent's telemetry schema, too?

Absolutely. Nothing prevents that and we should encourage the Agents to include schema_url in their own telemetry.

tigrannajaryan avatar Dec 10 '21 22:12 tigrannajaryan

As per the call today, the downside of a persistent connection is not only a problem with scaling a centralized management system, but can also prevent caching of configurations on things like CDNs. It can also create a single point of failure around configuring dynamic data collection during an outage which affects the monitoring system itself. I still think the UX being instant and responsive outweigh the scaling challenges which this approach could introduce.

jkowall avatar Dec 14 '21 19:12 jkowall

Reopening this.

It was closed automatically by https://github.com/open-telemetry/opamp-spec/issues/50, which makes it possible to use OpAMP for SDKs, but we haven't made any explicit decision to actually do it for any Otel SDK and how exactly it would be done.

I would like to keep this issue open as a place to capture further discussions on this topic.

tigrannajaryan avatar Nov 21 '24 18:11 tigrannajaryan