opamp-spec
opamp-spec copied to clipboard
Opamp spec overloads definition of service.name
AgentDescription.identifying_attributes says the following about service.name:
service.name should be set to a reverse FQDN that uniquely identifies the Agent type, e.g. "io.opentelemetry.collector"
This definition contradicts the service.name definition in the resource semantic conventions, which define it as "Logical name of the service".
This distinction is important because presumably opamp is using the "Agent type" language to serve as an identifier for the class of agent. I.e. to distinguish between collectors, SDKs, and perhaps other future agent types. This is obviously necessary, but service.name isn't the right attribute.
When it comes time for opamp to be applied to SDKs, it won't be possible to assign a agent type identifier to service.name since SDKs have broadly adopted the the logical name definition. On a related note, it doesn't make sense to assign io.opentelemetry.collector to service.name in collectors either, since it impedes the ability to have multiple sets of collectors performing different functions, each with different logic service.name.
I think we probably need a new resource attribute to accommodate the need for an agent type identifier.
This distinction is important because presumably opamp is using the "Agent type" language to serve as an identifier for the class of agent. I.e. to distinguish between collectors, SDKs, and perhaps other future agent types.
That is not quite what the problem is.
OpAMP doesn't use "agent type" in any special way. Nothing in OpAMP specifically depends on "agent type" or on service.name attribute specifically.
OpAMP says the following:
- Agent's have
identifying_attributes. [This is an OpAMP design matter, so appears good to me] - One of the recommended attributes is
service.name[This sounds reasonable in Otel world] - The value of
service.nameused in OpAMP should be equal to the value of theservice.nameused to report its own telemetry. [Again reasonable to make sure we can correlate between OpAMP and own telemetry] - The recommended value for
service.nameis FQDN. This part of OpAMP spec is probably the incorrect part. We should not be making any recommendations in OpAMP about whatservice.nameis. It is non of OpAMP business since it likely is already Otel spec's business.
I believe 1-3 are reasonable and nothing wrong with those. Number 4 likely needs to be deleted.
I think we probably need a new resource attribute to accommodate the need for an agent type identifier.
OpAMP doesn't really need it at all for now.
Thanks for the explanation!
OpAMP doesn't use "agent type" in any special way. Nothing in OpAMP specifically depends on "agent type" or on service.name attribute specifically.
Apologies if this has an obvious answer (I'm still wrapping my head around OpAMP): How would an OpAMP server differentiate between a client which is a collector vs. an SDK? Perhaps the type of client isn't the concern of the protocol and is instead something the operator of the OpAMP server is expected to know ahead of time for a set of identifying attributes?
How would an OpAMP server differentiate between a client which is a collector vs. an SDK? Perhaps the type of client isn't the concern of the protocol and is instead something the operator of the OpAMP server is expected to know ahead of time for a set of identifying attributes?
Yes, there is no expectation that an OpAMP server implementation will have any hard-coded logic that is based on the "type of the client". The way I envisioned it that on the server the end user can define configs associated with predicates that run on the identifying (and possibly on non-identifying) attributes and the config that matches the predicate is returned to the corresponding clients. So, knowing that Otel Collector uses service.name=otelcol the user will define a Collector config for clients that match that criteria.
Perhaps we need something more here. I am open to suggestions.
@andykellr I am also curious what you think.
I mostly agree with you, but I think having a convention for agent type is useful. As more agents implement OpAMP, management servers are going to want to show the users information about the agents that are connected, possibly visualizing them with corresponding icons or linking to documentation specific to those agents. Having an arbitrary format for agent type could potentially lead to duplicate names and FQDN avoids that situation.
Should this agent type be service.name or should we introduce a different attribute like service.type to identify the agent type? I'm not sure.
should we introduce a different attribute like
service.typeto identify the agent type? I'm not sure.
Probably this. Given that we were not sure what to put in the service.name on the Collector side, this may be a good option to add this to Otel semconv. We can require that it is a reverse FQDN.
I think service.type is conceptually correct, but maybe not the right name since its not clear that the type is relevant for opamp purposes.
What about something like:
- Resource attribute key is
opamp.agent.typeorservice.agent.type. The key is unambiguous in its purpose for opamp and not overloaded with multiple uses. - Possible values are known types of clients that can be configured by the opamp protocol. Currently that would just be
collector, but once SDKs are configurable via opamp, we would also addsdk. If someone uses the opamp protocol to remotely manage other agent types, they can specify their own custom value. - Value type is an array of strings, since a particular agent might be configurable as multiple agent types. For example, a collector could have its collector config configured, but the collector will also eventually have the go sdk installed in it, which would be separately configurable with opamp. If an opamp client sends multiple
opamp.agent.typevalues up to a server, the server must choose which type its responses are applicable for.
Resource attribute key is opamp.agent.type or service.agent.type. The key is unambiguous in its purpose for opamp and not overloaded with multiple uses.
@jack-berg I like the idea of semantic conventions that are specific for OpAMP usage. We still want service.name to be included in the identifying_attributes for the purpose of correlation with own telemetry.
However, nothing prevents us from include additional (non?)identifying attributes in OpAMP protocol itself, which are defined as semantic conventions that are specific to OpAMP. We can define a number of attributes which can then be used for fine or coarse classification on the server-side, e.g.:
service.name=otelcol
service.version=0.40.0
service.instance.id=<some uuid here>
opamp.agent.type=io.opentelemetry.collector
opamp.agent.distro=github.com/signalfx/splunk-otel-collector
Value type is an array of strings, since a particular agent might be configurable as multiple agent types. For example, a collector could have its collector config configured, but the collector will also eventually have the go sdk installed in it, which would be separately configurable with opamp. If an opamp client sends multiple opamp.agent.type values up to a server, the server must choose which type its responses are applicable for.
I wound't want to do this since it creates lots of addressability problems.
Instead, for this use-case the OpAMP client must simply represent 2 different agents: one for the collector, one for the go sdk. The protocol allows this, you can have multiple agents' data transported over one OpAMP connection.
@tigrannajaryan we discussed this issue during the SIG today. I'd like to go ahead and suggest we make a change in the spec to add the attributes that were already mentioned here as part of the standard identifying attributes for an agent (opamp.agent.type, opamp.agent.distro): this would help distinguish between different types of agent, and, unlike service attributes, would also be applied when agents are not standalone (sdk / language agents).
I think it would make sense to also have standard resource attributes for agents, that would be part of the telemetry.
Another point that was discussed is the agent uid, and how it relates to the lifecycle of the agent. While this is not mentioned in the spec, it looks like folks implementing this protocol try to persist the uid so that it remains unchanged after the agent is restarted. So it looks like we're missing a concept to identify agents that is more stable that a process. Perhaps this concept is the uid - then perhaps the spec could make it clearer and suggest that the uid be stable across restarts.
cc @andykellr @portertech
I'd like to go ahead and suggest we make a change in the spec to add the attributes that were already mentioned here as part of the standard identifying attributes for an agent (opamp.agent.type, opamp.agent.distro)
I agree, with one important difference: I believe these are non-identifying attributes. Identifying attributes are defined as attributes that are necessary for unique identification of the agent and are included in own metrics of the agent. We should not add arbitrary descriptive attributes to this list just because they are useful. The non-identifying attributes list can be arbitrarily long and has no such restrictions.
I am not sure where exactly we want to define these semantic conventions. It can be in OpAMP spec here in this repo or it can be in Otel's semantic conventions list.
Another point that was discussed is the agent uid, and how it relates to the lifecycle of the agent. While this is not mentioned in the spec, it looks like folks implementing this protocol try to persist the uid so that it remains unchanged after the agent is restarted. So it looks like we're missing a concept to identify agents that is more stable that a process. Perhaps this concept is the uid - then perhaps the spec could make it clearer and suggest that the uid be stable across restarts.
Persisting the instance id is useful but I think it should not be mandatory. We can add a recommendation that when possible the uid should be persistent. In some environments it may not be possible and I think it is OK if it is ephemeral.
Submitted this issue to discuss in semconv: https://github.com/open-telemetry/semantic-conventions/issues/554
All, the PR that adds service.type is created, but I and others have doubts that this is the right way. Please comment on the PR with arguments in favour or against it.