opentelemetry-ruby
                                
                                 opentelemetry-ruby copied to clipboard
                                
                                    opentelemetry-ruby copied to clipboard
                            
                            
                            
                        feat: SQS create process spans
This pr discussed as part of https://github.com/open-telemetry/opentelemetry-ruby/pull/1026
What's added in this PR?
- Added config option extract_messaging_contextdisabled by default.
- Extracting context propagation for SQS.ReceiveMessagemethod. In addition to thereceivespan, it creates an emptyprocessspan for every processed message, linking to the producer span.
Context propagation strategy
Currently SQS doesn't natively support propagating the OTEL context. Used JS aws-sdk instrumentation as an inspiration for this PR. It use message attributes to populate propagator fields.
Limitations
SQS support up to 10 message metadata attributes. Context fields will not be injected into the sqs message attributes if it will cause the limit to be reached.
Examples
Produce message span
#<struct OpenTelemetry::SDK::Trace::SpanData
 name="demo-topic send",
 kind=:producer,
 status=
  #<OpenTelemetry::Trace::Status:0x00007f9fd0960920 @code=1, @description="">,
 parent_span_id="\x00\x00\x00\x00\x00\x00\x00\x00",
 total_recorded_attributes=8,
 total_recorded_events=0,
 total_recorded_links=0,
 start_timestamp=1652257484971839000,
 end_timestamp=1652257485734213000,
 attributes=
  {"aws.region"=>"us-east-1",
   "rpc.system"=>"aws-api",
   "rpc.method"=>"Publish",
   "rpc.service"=>"SNS",
   "messaging.system"=>"aws.sns",
   "messaging.destination_kind"=>"topic",
   "messaging.destination"=>"demo-topic",
   "http.status_code"=>200},
 links=nil,
 events=nil,
 resource=
  #<OpenTelemetry::SDK::Resources::Resource:0x00007f9fcd29f620
   @attributes=
    {"service.name"=>"unknown_service",
     "process.pid"=>11398,
     "process.command"=>"trace_demonstration.rb",
     "process.runtime.name"=>"ruby",
     "process.runtime.version"=>"3.0.2",
     "process.runtime.description"=>
      "ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-darwin19]",
     "telemetry.sdk.name"=>"opentelemetry",
     "telemetry.sdk.language"=>"ruby",
     "telemetry.sdk.version"=>"1.0.3"}>,
 instrumentation_library=
  #<struct OpenTelemetry::SDK::InstrumentationLibrary
   name="OpenTelemetry::Instrumentation::AwsSdk",
   version="0.2.3">,
 span_id="\x04\x05E5\xC2i\x01E",
 trace_id="\xF7\xADg\xC1Q\xDB\xE4\x95G\xBF\x01v\x8A\xD2\x0E\xDD",
 trace_flags=#<OpenTelemetry::Trace::TraceFlags:0x00007f9fcd2827c8 @flags=1>,
 tracestate=#<OpenTelemetry::Trace::Tracestate:0x00007f9fcd29f378 @hash={}>>
Consume message spans
#<struct OpenTelemetry::SDK::Trace::SpanData
 name="demo-queue process",
 kind=:consumer,
 status=
  #<OpenTelemetry::Trace::Status:0x00007f79229ec8c8 @code=1, @description="">,
 parent_span_id="\xBE\x19\xE6,\xD9\xF4\xEF\xFF",
 total_recorded_attributes=6,
 total_recorded_events=0,
 total_recorded_links=1,
 start_timestamp=1652257507902723000,
 end_timestamp=1652257507902740000,
 attributes=
  {"messaging.system"=>"aws.sqs",
   "messaging.destination"=>"demo-queue",
   "messaging.destination_kind"=>"queue",
   "messaging.message_id"=>"0a7d851d-c3fd-4149-827d-67f6923a9d95",
   "messaging.url"=>
    "https://sqs.us-east-1.amazonaws.com/731241200035/demo-queue",
   "messaging.operation"=>"process"},
 links=
  [#<OpenTelemetry::Trace::Link:0x00007f7922aa5f30
    @attributes={},
    @span_context=
     #<OpenTelemetry::Trace::SpanContext:0x00007f7922aa61b0
      @remote=true,
      @span_id="\x04\x05E5\xC2i\x01E",
      @trace_flags=
       #<OpenTelemetry::Trace::TraceFlags:0x00007f7922aa6430 @flags=1>,
      @trace_id="\xF7\xADg\xC1Q\xDB\xE4\x95G\xBF\x01v\x8A\xD2\x0E\xDD",
      @tracestate=
       #<OpenTelemetry::Trace::Tracestate:0x00007f792393f190 @hash={}>>>],
 events=nil,
 resource=
  #<OpenTelemetry::SDK::Resources::Resource:0x00007f792393d9f8
   @attributes=
    {"service.name"=>"unknown_service",
     "process.pid"=>11440,
     "process.command"=>"trace_demonstration.rb",
     "process.runtime.name"=>"ruby",
     "process.runtime.version"=>"3.0.2",
     "process.runtime.description"=>
      "ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-darwin19]",
     "telemetry.sdk.name"=>"opentelemetry",
     "telemetry.sdk.language"=>"ruby",
     "telemetry.sdk.version"=>"1.0.3"}>,
 instrumentation_library=
  #<struct OpenTelemetry::SDK::InstrumentationLibrary
   name="OpenTelemetry::Instrumentation::AwsSdk",
   version="0.2.3">,
 span_id="&.(\xB1L\xC6\xB8K",
 trace_id="\x18\xDB\x0F7\x921$\fF\xCA\xD0\xABU`\xBB\x12",
 trace_flags=#<OpenTelemetry::Trace::TraceFlags:0x00007f7923926668 @flags=1>,
 tracestate=#<OpenTelemetry::Trace::Tracestate:0x00007f792393f190 @hash={}>>
#<struct OpenTelemetry::SDK::Trace::SpanData
 name="demo-queue receive",
 kind=:consumer,
 status=
  #<OpenTelemetry::Trace::Status:0x00007f79229ec8c8 @code=1, @description="">,
 parent_span_id="\x00\x00\x00\x00\x00\x00\x00\x00",
 total_recorded_attributes=10,
 total_recorded_events=0,
 total_recorded_links=0,
 start_timestamp=1652257507141648000,
 end_timestamp=1652257507904882000,
 attributes=
  {"aws.region"=>"us-east-1",
   "rpc.system"=>"aws-api",
   "rpc.method"=>"ReceiveMessage",
   "rpc.service"=>"SQS",
   "messaging.system"=>"aws.sqs",
   "messaging.destination_kind"=>"queue",
   "messaging.destination"=>"demo-queue",
   "messaging.url"=>
    "https://sqs.us-east-1.amazonaws.com/731241200035/demo-queue",
   "messaging.operation"=>"receive",
   "http.status_code"=>200},
 links=nil,
 events=nil,
 resource=
  #<OpenTelemetry::SDK::Resources::Resource:0x00007f792393d9f8
   @attributes=
    {"service.name"=>"unknown_service",
     "process.pid"=>11440,
     "process.command"=>"trace_demonstration.rb",
     "process.runtime.name"=>"ruby",
     "process.runtime.version"=>"3.0.2",
     "process.runtime.description"=>
      "ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-darwin19]",
     "telemetry.sdk.name"=>"opentelemetry",
     "telemetry.sdk.language"=>"ruby",
     "telemetry.sdk.version"=>"1.0.3"}>,
 instrumentation_library=
  #<struct OpenTelemetry::SDK::InstrumentationLibrary
   name="OpenTelemetry::Instrumentation::AwsSdk",
   version="0.2.3">,
 span_id="\xBE\x19\xE6,\xD9\xF4\xEF\xFF",
 trace_id="\x18\xDB\x0F7\x921$\fF\xCA\xD0\xABU`\xBB\x12",
 trace_flags=#<OpenTelemetry::Trace::TraceFlags:0x00007f7923926668 @flags=1>,
 tracestate=#<OpenTelemetry::Trace::Tracestate:0x00007f792393f190 @hash={}>>
@YanivD I will try to review this week, i am on PTO but have some limited availability. This is great, adding extract_messaging_context is very useful to add. I want to think a bit more about the parent/child+link pattern that's introduced as well as the additional span verbosity of the process spans.
@YanivD sorry for the long delay here. We talked this over in the SIG yesterday, and were wondering if you'd like to try adding a configuration option that lets users pick how they'd like to propagate span context. So a user could choose to:
- have the trace generated by a message consumer be a child span of the producer
- have the trace generated by a message consumer be linked to the producer
- have no relationship between message consumer and message producer
Something like: option :propagation_style, default: :link, validate: %i[link child none] This is what we're doing in the active_job instrumentation, que, etc.
Any thoughts?
@YanivD sorry for the long delay here. We talked this over in the SIG yesterday, and were wondering if you'd like to try adding a configuration option that lets users pick how they'd like to propagate span context. So a user could choose to:
- have the trace generated by a message consumer be a child span of the producer
- have the trace generated by a message consumer be linked to the producer
- have no relationship between message consumer and message producer
Something like:
option :propagation_style, default: :link, validate: %i[link child none]This is what we're doing in theactive_jobinstrumentation,que, etc.Any thoughts?
I am against it for the following reasons:
- This behavior is not part of the current specification. The specification states that batch receive should store links to producers on processspans
- If a user connects processspan toproduceras a child, then the link toreceiveis lost, e.g. - there is a trace that only performsreceiveand then you are not able to tell what it received (which messages). Any batch operation also loses context to the consumed messages.
- Sampling is more complex, as the sampling decision is now propagated instead of being set in a consistent manner. It also creates a possibility that received is sampled but some of the process are not and vice versa which can be surprising to users.
- The semantic conventions for messaging systems SIG is currently working on a new specification to cover those cases correctly. The instrumentation will be refactored anyhow in the near future and will look completely different. I recommend bringing this issue up in the next messaging SIG meeting "Every Thursday at 8:00 PT" and discussing this use case and what specification should allow as a valid "batch consumer" trace. Even though connecting processtoproduceris handly, it has other downsides that the community is actively discussing and it's the perfect time to bring these concerns up.
I am against it for the following reasons:
This behavior is not part of the current specification. The specification states that batch receive should store links to producers on process spans
We are aware of this. We deviated from the spec because of a combination of our own experience and end user reports. IIRC links are a re-definition of FollowsFrom for OpenTracing, which is unfortunately not widely supported by many of the tracing vendors. That was the motivation of adding the option to select child as an option for propagation on the receive side.
I think we should drop support for none because I do not think there is a use case where this is helpful to our end-users.
If a user connects process span to producer as a child, then the link to receive is lost, e.g. - there is a trace that only performs receive and then you are not able to tell what it received (which messages). Any batch operation also loses context to the consumed messages.
I am not sure I understand what you mean by this. How is the link lost if we specify propagation_style: child then the propagation extract will continue the trace using parent context instead of as a link.
Sampling is more complex, as the sampling decision is now propagated instead of being set in a consistent manner. It also creates a possibility that received is sampled but some of the process are not and vice versa which can be surprising to users.
I am not familiar with the specifics about how this would impact a sampling decision. Is this there somewhere I would be able to read up on this?
The semantic conventions for messaging systems SIG is currently working on a new specification to cover those cases correctly. The instrumentation will be refactored anyhow in the near future and will look completely different. I recommend bringing this issue up in the next messaging SIG meeting "Every Thursday at 8:00 PT" and discussing this use case and what specification should allow as a valid "batch consumer" trace. Even though connecting process to producer is handly, it has other downsides that the community is actively discussing and it's the perfect time to bring these concerns up.
I do not think we have anyone attending that Sig. Is there an async way to cover this?
Anyone know what team we would be able to ping? https://github.com/orgs/open-telemetry/teams?query=
Should we open a discussion in the otel-spec repo?
This PR allow users to have process spans linked to the producer span. In this PR I followed the existing behavior in node instrumentation and the current messaging spec:
Note that one or multiple Spans with messaging.operation = process may often be the children of a Span with messaging.operation = receive
I don't have any objection to support also child propagation, but I don't have the time resources to do it.
IMO this PR is ready for review as it is, it provide value to OTEL users. I'll be more than happy to see the instrumentation supports more propagation styles, once someone will need this feature.
Maybe I do not see them. Are there any unit tests that cover this functionality?
Co-sign this - can we add some unit tests? I think this PR is good to review as-is (see thread in CNCF Slack #otel-ruby channel).
If we want to add support for child/none option, then we can do so in a follow-on PR.
For future 👀 there is an OTEP brewing that attempts to bring some clarity to this.
I won't be able to make the requested changes in the near future. I would really appreciate if someone can assist and implement the changes for this PR.
Hello, and thank you for your contribution!
We recently split Ruby instrumentation out into the opentelemetry-ruby-contrib repo.
This PR is related to instrumentation, so I am going to re-open it against opentelemetry-ruby-contrib. I will link to that when I have done so.
This PR has been migrated https://github.com/open-telemetry/opentelemetry-ruby-contrib/pull/51
I won't be able to make the requested changes in the near future. I would really appreciate if someone can assist and implement the changes for this PR.
Thanks for letting us know @YanivD — hopefully I can finish this one up myself!
Closing this PR what with it have been moved to opentelemetry-ruby-contrib.