Introduction of a Synthetic Attribute for Server Span Telemetry
Area(s)
area:browser
Is your change request related to a problem? Please describe.
I would like to be able to identify telemetry created by synthetic sources such as bots or crawlers. This issue looks to work on defining conventions surrounding marking spans as originating from a synthetic source.
Describe the solution you'd like
I would like to introduce an attribute to HTTP server span semantic conventions, as well as metrics and logs that represents a low-cardinality string such as the below:
synthetic -> "not set" | "bot" | "synthetic test"
Where the synthetic attribute being set to "not set" represents telemetry that is not generated from a synthetic source. This convention will be helpful for scenarios where a user may want to mark telemetry generated from frequent synthetic tests or web crawlers separately from direct human engagement.
The determination of which of the three options a span falls into could be made by maintaining a list of known synthetic sources or allowing this decision to be user configurable.
Describe alternatives you've considered
While we could consider setting the synthetic attribute to a Boolean value, I believe the extra granularity of the low-cardinality string would be valuable.
Additional context
No response
https://github.com/open-telemetry/semantic-conventions/issues/1230
A few questions/thoughts:
- While non-HTTP usage would probably be low/non-exisent, I don't think it belongs in HTTP domain. So I'd consider adding some attribute like
user_agent.type(probably needs a better name). - Is there some prior art in the industry to identify a synthetic source/bot user? Is there an attribute in ECS for it? Are there some non-telemetry
user-agentconventions for it? - nit: let's just not set an attribute instead of using
not_setvalue.
It'd be awesome if you could send a PR with a specific proposal (considering the above).
A few questions/thoughts:
- While non-HTTP usage would probably be low/non-exisent, I don't think it belongs in HTTP domain. So I'd consider adding some attribute like
user_agent.type(probably needs a better name).- Is there some prior art in the industry to identify a synthetic source/bot user? Is there an attribute in ECS for it? Are there some non-telemetry
user-agentconventions for it?- nit: let's just not set an attribute instead of using
not_setvalue.It'd be awesome if you could send a PR with a specific proposal (considering the above).
Thank you for your feedback on this issue! Just a couple questions regarding your first point:
- While I don't expect non-HTTP telemetry to need this synthetic source flag, I suppose it could be more generic and defined outside of HTTP specifically. However, I'm struggling to find any more relevant association for this. For example, if I want to define some attribute on the
spans.yaml, I have the options of http, rpc, faas, rpc, gen-ai, database, messaging, and cloud-events. None of which seem to be more relevant than http for something like synthetic source. Maybe I'm missing something about the structure of the semantic conventions here. - I'm also curious about the idea for a
user_agent.typefield, what kind of data would a field with that name hold?
I think we need to get some clarity regarding "what is synthetic source". For example, do we think it'll be a static list of client types (e.g. Agent header for HTTP) or a list that will be frequently updated?
For example, we do not want to have an explicit flag saying "this trace is a result of a synthetic request" then we noticed "oops, we just realized that there are other traces from agent XYZ, and this agent is actually powered by AI/LLM so the previously added synthetic flag should be fixed".
@reyang I think it'll be important to keep the list of known synthetic sources updated over time as there's no way to predict how popular a certain bot might become.
I'm a little confused by your example. Are you essentially saying that in the scenario, it would be possible that we would miss synthetic traces created by newer technologies if we only maintained a static list?