semantic-conventions icon indicating copy to clipboard operation
semantic-conventions copied to clipboard

Introduction of a Synthetic Attribute for Server Span Telemetry

Open JacksonWeber opened this issue 1 year ago • 4 comments

Area(s)

area:browser

Is your change request related to a problem? Please describe.

I would like to be able to identify telemetry created by synthetic sources such as bots or crawlers. This issue looks to work on defining conventions surrounding marking spans as originating from a synthetic source.

Describe the solution you'd like

I would like to introduce an attribute to HTTP server span semantic conventions, as well as metrics and logs that represents a low-cardinality string such as the below:

synthetic -> "not set" | "bot" | "synthetic test"

Where the synthetic attribute being set to "not set" represents telemetry that is not generated from a synthetic source. This convention will be helpful for scenarios where a user may want to mark telemetry generated from frequent synthetic tests or web crawlers separately from direct human engagement.

The determination of which of the three options a span falls into could be made by maintaining a list of known synthetic sources or allowing this decision to be user configurable.

Describe alternatives you've considered

While we could consider setting the synthetic attribute to a Boolean value, I believe the extra granularity of the low-cardinality string would be valuable.

Additional context

No response

JacksonWeber avatar Jun 06 '24 01:06 JacksonWeber

https://github.com/open-telemetry/semantic-conventions/issues/1230

MSNev avatar Jun 06 '24 19:06 MSNev

A few questions/thoughts:

  • While non-HTTP usage would probably be low/non-exisent, I don't think it belongs in HTTP domain. So I'd consider adding some attribute like user_agent.type (probably needs a better name).
  • Is there some prior art in the industry to identify a synthetic source/bot user? Is there an attribute in ECS for it? Are there some non-telemetry user-agent conventions for it?
  • nit: let's just not set an attribute instead of using not_set value.

It'd be awesome if you could send a PR with a specific proposal (considering the above).

lmolkova avatar Oct 07 '24 16:10 lmolkova

A few questions/thoughts:

  • While non-HTTP usage would probably be low/non-exisent, I don't think it belongs in HTTP domain. So I'd consider adding some attribute like user_agent.type (probably needs a better name).
  • Is there some prior art in the industry to identify a synthetic source/bot user? Is there an attribute in ECS for it? Are there some non-telemetry user-agent conventions for it?
  • nit: let's just not set an attribute instead of using not_set value.

It'd be awesome if you could send a PR with a specific proposal (considering the above).

Thank you for your feedback on this issue! Just a couple questions regarding your first point:

  • While I don't expect non-HTTP telemetry to need this synthetic source flag, I suppose it could be more generic and defined outside of HTTP specifically. However, I'm struggling to find any more relevant association for this. For example, if I want to define some attribute on the spans.yaml, I have the options of http, rpc, faas, rpc, gen-ai, database, messaging, and cloud-events. None of which seem to be more relevant than http for something like synthetic source. Maybe I'm missing something about the structure of the semantic conventions here.
  • I'm also curious about the idea for a user_agent.type field, what kind of data would a field with that name hold?

JacksonWeber avatar Oct 09 '24 18:10 JacksonWeber

I think we need to get some clarity regarding "what is synthetic source". For example, do we think it'll be a static list of client types (e.g. Agent header for HTTP) or a list that will be frequently updated?

For example, we do not want to have an explicit flag saying "this trace is a result of a synthetic request" then we noticed "oops, we just realized that there are other traces from agent XYZ, and this agent is actually powered by AI/LLM so the previously added synthetic flag should be fixed".

reyang avatar Oct 10 '24 16:10 reyang

@reyang I think it'll be important to keep the list of known synthetic sources updated over time as there's no way to predict how popular a certain bot might become.

I'm a little confused by your example. Are you essentially saying that in the scenario, it would be possible that we would miss synthetic traces created by newer technologies if we only maintained a static list?

JacksonWeber avatar Oct 10 '24 20:10 JacksonWeber