opentelemetry-ruby icon indicating copy to clipboard operation
opentelemetry-ruby copied to clipboard

feat: SQS create process spans

Open YanivD opened this issue 3 years ago • 10 comments

This pr discussed as part of https://github.com/open-telemetry/opentelemetry-ruby/pull/1026


What's added in this PR?

  • Added config option extract_messaging_context disabled by default.
  • Extracting context propagation for SQS.ReceiveMessage method. In addition to the receive span, it creates an empty process span for every processed message, linking to the producer span.

Context propagation strategy

Currently SQS doesn't natively support propagating the OTEL context. Used JS aws-sdk instrumentation as an inspiration for this PR. It use message attributes to populate propagator fields.

Limitations

SQS support up to 10 message metadata attributes. Context fields will not be injected into the sqs message attributes if it will cause the limit to be reached.

Examples

Produce message span

#<struct OpenTelemetry::SDK::Trace::SpanData
 name="demo-topic send",
 kind=:producer,
 status=
  #<OpenTelemetry::Trace::Status:0x00007f9fd0960920 @code=1, @description="">,
 parent_span_id="\x00\x00\x00\x00\x00\x00\x00\x00",
 total_recorded_attributes=8,
 total_recorded_events=0,
 total_recorded_links=0,
 start_timestamp=1652257484971839000,
 end_timestamp=1652257485734213000,
 attributes=
  {"aws.region"=>"us-east-1",
   "rpc.system"=>"aws-api",
   "rpc.method"=>"Publish",
   "rpc.service"=>"SNS",
   "messaging.system"=>"aws.sns",
   "messaging.destination_kind"=>"topic",
   "messaging.destination"=>"demo-topic",
   "http.status_code"=>200},
 links=nil,
 events=nil,
 resource=
  #<OpenTelemetry::SDK::Resources::Resource:0x00007f9fcd29f620
   @attributes=
    {"service.name"=>"unknown_service",
     "process.pid"=>11398,
     "process.command"=>"trace_demonstration.rb",
     "process.runtime.name"=>"ruby",
     "process.runtime.version"=>"3.0.2",
     "process.runtime.description"=>
      "ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-darwin19]",
     "telemetry.sdk.name"=>"opentelemetry",
     "telemetry.sdk.language"=>"ruby",
     "telemetry.sdk.version"=>"1.0.3"}>,
 instrumentation_library=
  #<struct OpenTelemetry::SDK::InstrumentationLibrary
   name="OpenTelemetry::Instrumentation::AwsSdk",
   version="0.2.3">,
 span_id="\x04\x05E5\xC2i\x01E",
 trace_id="\xF7\xADg\xC1Q\xDB\xE4\x95G\xBF\x01v\x8A\xD2\x0E\xDD",
 trace_flags=#<OpenTelemetry::Trace::TraceFlags:0x00007f9fcd2827c8 @flags=1>,
 tracestate=#<OpenTelemetry::Trace::Tracestate:0x00007f9fcd29f378 @hash={}>>

Consume message spans

#<struct OpenTelemetry::SDK::Trace::SpanData
 name="demo-queue process",
 kind=:consumer,
 status=
  #<OpenTelemetry::Trace::Status:0x00007f79229ec8c8 @code=1, @description="">,
 parent_span_id="\xBE\x19\xE6,\xD9\xF4\xEF\xFF",
 total_recorded_attributes=6,
 total_recorded_events=0,
 total_recorded_links=1,
 start_timestamp=1652257507902723000,
 end_timestamp=1652257507902740000,
 attributes=
  {"messaging.system"=>"aws.sqs",
   "messaging.destination"=>"demo-queue",
   "messaging.destination_kind"=>"queue",
   "messaging.message_id"=>"0a7d851d-c3fd-4149-827d-67f6923a9d95",
   "messaging.url"=>
    "https://sqs.us-east-1.amazonaws.com/731241200035/demo-queue",
   "messaging.operation"=>"process"},
 links=
  [#<OpenTelemetry::Trace::Link:0x00007f7922aa5f30
    @attributes={},
    @span_context=
     #<OpenTelemetry::Trace::SpanContext:0x00007f7922aa61b0
      @remote=true,
      @span_id="\x04\x05E5\xC2i\x01E",
      @trace_flags=
       #<OpenTelemetry::Trace::TraceFlags:0x00007f7922aa6430 @flags=1>,
      @trace_id="\xF7\xADg\xC1Q\xDB\xE4\x95G\xBF\x01v\x8A\xD2\x0E\xDD",
      @tracestate=
       #<OpenTelemetry::Trace::Tracestate:0x00007f792393f190 @hash={}>>>],
 events=nil,
 resource=
  #<OpenTelemetry::SDK::Resources::Resource:0x00007f792393d9f8
   @attributes=
    {"service.name"=>"unknown_service",
     "process.pid"=>11440,
     "process.command"=>"trace_demonstration.rb",
     "process.runtime.name"=>"ruby",
     "process.runtime.version"=>"3.0.2",
     "process.runtime.description"=>
      "ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-darwin19]",
     "telemetry.sdk.name"=>"opentelemetry",
     "telemetry.sdk.language"=>"ruby",
     "telemetry.sdk.version"=>"1.0.3"}>,
 instrumentation_library=
  #<struct OpenTelemetry::SDK::InstrumentationLibrary
   name="OpenTelemetry::Instrumentation::AwsSdk",
   version="0.2.3">,
 span_id="&.(\xB1L\xC6\xB8K",
 trace_id="\x18\xDB\x0F7\x921$\fF\xCA\xD0\xABU`\xBB\x12",
 trace_flags=#<OpenTelemetry::Trace::TraceFlags:0x00007f7923926668 @flags=1>,
 tracestate=#<OpenTelemetry::Trace::Tracestate:0x00007f792393f190 @hash={}>>

#<struct OpenTelemetry::SDK::Trace::SpanData
 name="demo-queue receive",
 kind=:consumer,
 status=
  #<OpenTelemetry::Trace::Status:0x00007f79229ec8c8 @code=1, @description="">,
 parent_span_id="\x00\x00\x00\x00\x00\x00\x00\x00",
 total_recorded_attributes=10,
 total_recorded_events=0,
 total_recorded_links=0,
 start_timestamp=1652257507141648000,
 end_timestamp=1652257507904882000,
 attributes=
  {"aws.region"=>"us-east-1",
   "rpc.system"=>"aws-api",
   "rpc.method"=>"ReceiveMessage",
   "rpc.service"=>"SQS",
   "messaging.system"=>"aws.sqs",
   "messaging.destination_kind"=>"queue",
   "messaging.destination"=>"demo-queue",
   "messaging.url"=>
    "https://sqs.us-east-1.amazonaws.com/731241200035/demo-queue",
   "messaging.operation"=>"receive",
   "http.status_code"=>200},
 links=nil,
 events=nil,
 resource=
  #<OpenTelemetry::SDK::Resources::Resource:0x00007f792393d9f8
   @attributes=
    {"service.name"=>"unknown_service",
     "process.pid"=>11440,
     "process.command"=>"trace_demonstration.rb",
     "process.runtime.name"=>"ruby",
     "process.runtime.version"=>"3.0.2",
     "process.runtime.description"=>
      "ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-darwin19]",
     "telemetry.sdk.name"=>"opentelemetry",
     "telemetry.sdk.language"=>"ruby",
     "telemetry.sdk.version"=>"1.0.3"}>,
 instrumentation_library=
  #<struct OpenTelemetry::SDK::InstrumentationLibrary
   name="OpenTelemetry::Instrumentation::AwsSdk",
   version="0.2.3">,
 span_id="\xBE\x19\xE6,\xD9\xF4\xEF\xFF",
 trace_id="\x18\xDB\x0F7\x921$\fF\xCA\xD0\xABU`\xBB\x12",
 trace_flags=#<OpenTelemetry::Trace::TraceFlags:0x00007f7923926668 @flags=1>,
 tracestate=#<OpenTelemetry::Trace::Tracestate:0x00007f792393f190 @hash={}>>

YanivD avatar May 11 '22 08:05 YanivD

@YanivD I will try to review this week, i am on PTO but have some limited availability. This is great, adding extract_messaging_context is very useful to add. I want to think a bit more about the parent/child+link pattern that's introduced as well as the additional span verbosity of the process spans.

ericmustin avatar May 11 '22 17:05 ericmustin

@YanivD sorry for the long delay here. We talked this over in the SIG yesterday, and were wondering if you'd like to try adding a configuration option that lets users pick how they'd like to propagate span context. So a user could choose to:

  1. have the trace generated by a message consumer be a child span of the producer
  2. have the trace generated by a message consumer be linked to the producer
  3. have no relationship between message consumer and message producer

Something like: option :propagation_style, default: :link, validate: %i[link child none] This is what we're doing in the active_job instrumentation, que, etc.

Any thoughts?

plantfansam avatar May 18 '22 22:05 plantfansam

@YanivD sorry for the long delay here. We talked this over in the SIG yesterday, and were wondering if you'd like to try adding a configuration option that lets users pick how they'd like to propagate span context. So a user could choose to:

  1. have the trace generated by a message consumer be a child span of the producer
  2. have the trace generated by a message consumer be linked to the producer
  3. have no relationship between message consumer and message producer

Something like: option :propagation_style, default: :link, validate: %i[link child none] This is what we're doing in the active_job instrumentation, que, etc.

Any thoughts?

I am against it for the following reasons:

  1. This behavior is not part of the current specification. The specification states that batch receive should store links to producers on process spans
  2. If a user connects process span to producer as a child, then the link to receive is lost, e.g. - there is a trace that only performs receive and then you are not able to tell what it received (which messages). Any batch operation also loses context to the consumed messages.
  3. Sampling is more complex, as the sampling decision is now propagated instead of being set in a consistent manner. It also creates a possibility that received is sampled but some of the process are not and vice versa which can be surprising to users.
  4. The semantic conventions for messaging systems SIG is currently working on a new specification to cover those cases correctly. The instrumentation will be refactored anyhow in the near future and will look completely different. I recommend bringing this issue up in the next messaging SIG meeting "Every Thursday at 8:00 PT" and discussing this use case and what specification should allow as a valid "batch consumer" trace. Even though connecting process to producer is handly, it has other downsides that the community is actively discussing and it's the perfect time to bring these concerns up.

blumamir avatar May 23 '22 11:05 blumamir

I am against it for the following reasons:

This behavior is not part of the current specification. The specification states that batch receive should store links to producers on process spans

We are aware of this. We deviated from the spec because of a combination of our own experience and end user reports. IIRC links are a re-definition of FollowsFrom for OpenTracing, which is unfortunately not widely supported by many of the tracing vendors. That was the motivation of adding the option to select child as an option for propagation on the receive side.

I think we should drop support for none because I do not think there is a use case where this is helpful to our end-users.

If a user connects process span to producer as a child, then the link to receive is lost, e.g. - there is a trace that only performs receive and then you are not able to tell what it received (which messages). Any batch operation also loses context to the consumed messages.

I am not sure I understand what you mean by this. How is the link lost if we specify propagation_style: child then the propagation extract will continue the trace using parent context instead of as a link.

Sampling is more complex, as the sampling decision is now propagated instead of being set in a consistent manner. It also creates a possibility that received is sampled but some of the process are not and vice versa which can be surprising to users.

I am not familiar with the specifics about how this would impact a sampling decision. Is this there somewhere I would be able to read up on this?

The semantic conventions for messaging systems SIG is currently working on a new specification to cover those cases correctly. The instrumentation will be refactored anyhow in the near future and will look completely different. I recommend bringing this issue up in the next messaging SIG meeting "Every Thursday at 8:00 PT" and discussing this use case and what specification should allow as a valid "batch consumer" trace. Even though connecting process to producer is handly, it has other downsides that the community is actively discussing and it's the perfect time to bring these concerns up.

I do not think we have anyone attending that Sig. Is there an async way to cover this?

Anyone know what team we would be able to ping? https://github.com/orgs/open-telemetry/teams?query=

Should we open a discussion in the otel-spec repo?

arielvalentin avatar May 23 '22 14:05 arielvalentin

This PR allow users to have process spans linked to the producer span. In this PR I followed the existing behavior in node instrumentation and the current messaging spec:

Note that one or multiple Spans with messaging.operation = process may often be the children of a Span with messaging.operation = receive

I don't have any objection to support also child propagation, but I don't have the time resources to do it.

IMO this PR is ready for review as it is, it provide value to OTEL users. I'll be more than happy to see the instrumentation supports more propagation styles, once someone will need this feature.

YanivD avatar May 25 '22 08:05 YanivD

Maybe I do not see them. Are there any unit tests that cover this functionality?

Co-sign this - can we add some unit tests? I think this PR is good to review as-is (see thread in CNCF Slack #otel-ruby channel).

If we want to add support for child/none option, then we can do so in a follow-on PR.

For future 👀 there is an OTEP brewing that attempts to bring some clarity to this.

plantfansam avatar May 27 '22 12:05 plantfansam

I won't be able to make the requested changes in the near future. I would really appreciate if someone can assist and implement the changes for this PR.

YanivD avatar Jun 16 '22 07:06 YanivD

Hello, and thank you for your contribution!

We recently split Ruby instrumentation out into the opentelemetry-ruby-contrib repo.

This PR is related to instrumentation, so I am going to re-open it against opentelemetry-ruby-contrib. I will link to that when I have done so.

plantfansam avatar Jun 17 '22 17:06 plantfansam

This PR has been migrated https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/51

plantfansam avatar Jun 17 '22 17:06 plantfansam

I won't be able to make the requested changes in the near future. I would really appreciate if someone can assist and implement the changes for this PR.

Thanks for letting us know @YanivD — hopefully I can finish this one up myself!

plantfansam avatar Jun 17 '22 17:06 plantfansam

Closing this PR what with it have been moved to opentelemetry-ruby-contrib.

robbkidd avatar Dec 07 '23 20:12 robbkidd