opentelemetry-ruby-contrib icon indicating copy to clipboard operation
opentelemetry-ruby-contrib copied to clipboard

GraphQL dataloaders not compatible with opentelemetry tracing plugin

Open stevecrozz opened this issue 3 years ago • 4 comments
trafficstars

Ruby 2.7.3p183 (2021-04-05 revision 6847ee089d) [x86_64-darwin20] GraphQL Gem Version: 2.0.13

Options enable_platform_field and enable_platform_authorized are not compatible with GraphQL gem's own fiber based dataloaders. Turning on these options while using the built-in dataloaders results in missing data and logs like:

ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.
ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.
ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.
ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.
ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.
ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.

I'm assuming this has to do with the way these data loaders work. Somehow, this appeared to be working fine while using NewRelic tracing and it was able to capture traces with details on field hops and authorizations. Removing the dataloaders fixes this problem, but we need them to prevent N+1 query issues.

# field method
def team
  dataloader.with(Sources::Record, Team).load(object.team_id)
end

# authorized? method
def self.authorized?(object, context)
  can_access_team?(context.dataloader.with(Sources::Record, Team).load(object.team_id))
end

stevecrozz avatar Sep 16 '22 22:09 stevecrozz

Nice call out that NewRelic's works. Seeing similar logs on our end - though it seems to work a decent amount of the time? Haven't nailed down where/when we miss things due to this issue. Did a little preliminary looking:

  1. Here's the graphql tracing base-class https://github.com/rmosolgo/graphql-ruby/blob/master/lib/graphql/tracing/platform_tracing.rb
  2. here's NewRelic's graphql tracer: https://github.com/rmosolgo/graphql-ruby/blob/master/lib/graphql/tracing/new_relic_tracing.rb
  3. here's OTEL's: https://github.com/open-telemetry/opentelemetry-ruby-contrib/blob/main/instrumentation/graphql/lib/opentelemetry/instrumentation/graphql/tracers/graphql_tracer.rb

There are some decisions around transaction-name that seem like it could be a red herring? Unsure. Seems more likely that the way their tracing works is slightly different from how OTEL's span implementation works. For more reference, here's NR tracer vs OTEL tracing in Ruby:

  1. OTEL: https://opentelemetry.io/docs/instrumentation/ruby/manual/#creating-new-spans
  2. New Relic: https://github.com/newrelic/newrelic-ruby-agent/blob/fb7e49b5a89c58cf3ea074a2b6ee7c9922e316b9/lib/new_relic/agent/method_tracer.rb#L71

jphenow avatar Sep 20 '22 16:09 jphenow

I've done a bit of investigation into this and its maybe less than perfectly clear how this should work. Take an overly simplified GraphQL query with following structure, mid-execution:

- Teams (resolved)
  - Team1 (resolved)
    - Players
      - Player1 (paused dataloader fiber, pushed to the span stack, not yet popped)
      - Player2 (paused dataloader fiber, pushed to the span stack, not yet popped)
  - Team2 (resolved)
    - Players
      - Player3 (paused dataloader fiber, pushed to the span stack, not yet popped)
      - Player4 (running fiber)

This above is the picture of what I think is happening. The fibers eventually unpause but the generated span structure is all wrong and by the time we unpause the fibers, the first one runs and it wants to pop the wrong span because so many other fibers have added to the span stack.

Just thinking out loud here: What I'd like have happen for the above example is somehow report time spent fetching the players for a team. NewRelic reports this as 'GraphQL/Team/Player' (not a specific team or player) and I believe provides aggregate info only (sum).

What we probably don't want is to push a new span onto the stack for every single Team/Player combination and I think that's what enable_platform_field is doing. So instead of pushing and popping 4 spans in the above example, I'd like instead to push a single Team/Player span within the overall graphql execution context. I'm an OTEL beginner, but it seems like too much to report every single graph node hop as a span. There could be thousands and thousands of them and they could be deeply nested.

Maybe @rmosolgo can correct me where I'm wrong or provide some ideas on how we could better gain observability via OTEL.

stevecrozz avatar Sep 20 '22 17:09 stevecrozz

One thing worth calling out at the moment: We're having this issue and using dataloaders but have not gotten to tinker with our fiber scheduling at all for this.

What are you using for scheduling?

jphenow avatar Sep 20 '22 17:09 jphenow

We're on Ruby 2.7 so I believe we are not able to use fiber schedulers.

stevecrozz avatar Sep 21 '22 18:09 stevecrozz

@arielvalentin and I just poked around at this for a bit in the Ruby SIG (some notes). We at least have a hypothesis about what's going on (caveat: my knowledge of Fibers is pretty weak, so I might be way off-base here).

Some background: the Ruby OpenTelemetry API uses a Fiber-local array to keep track of the current and previous OpenTelemetry Contexts. When you attach a context, you receive a token that should be used when detaching from the context — this token is just the size of the Fiber-local context array after a new context has been pushed to it. If you try to detach a context with a token that's a different size than the current, Fiber-local stack, you see errors like those posted above.

Now to the ruby-graphql gem: it looks like when it spawns fibers, it copies references to all of the spawning fiber's Fiber-local variables onto the spawned fiber. I think this means that each spawned fiber will have a pointer to the same context stack (rather than a duplicate of the context stack). If that's true, then multiple, concurrently-executing fibers from the GraphQL gem would push and pop to the same context stack, which would lead to the errors shown above.

Not exactly sure where that leaves us, but @fbogsany might have some thoughts, here.

plantfansam avatar Jan 03 '23 18:01 plantfansam

👋 This issue has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this issue will be closed eventually by the stale bot.

github-actions[bot] avatar Apr 27 '23 01:04 github-actions[bot]