graphql-metrics Metrics for multiplexes are a bit broken/inaccurate.

Edit: Problems #1 and #2 appear to be non-problems. See below.

As demonstrated in this PR, and in the below multiplex scenario walkthrough, timings for parsing, lexing and query duration are wrong when executing multiplexes.

There are last-write wins problems in lexing/parsing timings in the Tracer, and query durations are inaccurate because we wait until Instrumentation.after_query to calculate them.

The queries being executed in the throwaway PR are:

test 'multiplex test' do
    queries = [
    {
      query: 'query QueryOne { post(id: "42") { id title } }',
      operation_name: 'QueryOne',
    },
    {
      query: 'query QueryTwo { post(id: "42") { body } }',
      operation_name: 'QueryTwo',
    },
  ]

  SchemaWithFullMetrics.multiplex(queries)
# ...

Tracer.trace(execute_multiplex)
- Capturing pre_context.multiplex_start_time_* in pre_context
- NOTE: So far so good. This is captured once at the outset, and is stored in each query's context once it exists.
Tracer.trace(lex)
- Capturing pre_context.lexing_* in pre_context
- Problem #1: This value will be overwritten by any later documents in the multiplex which are lexed.
Tracer.trace(parse)
- Capturing pre_context.parsing_* in pre_context
- Problem #2: This value will be overwritten by any later documents in the multiplex which are parsed.
Tracer.trace(validate)
- Capturing last pre_context.parsing_, lexing_ times in this query's context
- Capturing validation time in this query's context
- NOTE: This first query will correctly receive the lex and parse times in its context.
- NOTE: This query will correctly store its validation time.
Tracer.trace(lex)
- Capturing pre_context.lexing_* in pre_context
- Problem #1
Tracer.trace(parse)
- Capturing pre_context.parsing_* in pre_context
- Problem #2
Tracer.trace(validate)
- Capturing last pre_context.parsing_, lexing_ times in this query's context
- Capturing validation time in this query's context
- Problems #1 & #2: This nth query will store the first query's lex and parse times.
- NOTE: This query will correctly store its validation time.
Tracer.trace(analyze_multiplex)
- Here we clear the pre_context.
Tracer.trace(analyze_query)
- (Current operation: QueryOne)
- Analyzer.on_leave_field - id
- Analyzer.on_leave_field - title
- Analyzer.on_leave_field - post
- Capturing analysis time in this query's context
Tracer.trace(analyze_query)
- (Current operation: QueryTwo)
- Analyzer.on_leave_field - body
- Analyzer.on_leave_field - post
- Capturing analysis time in this query's context
Tracer.trace(execute_query)
- Capturing execute time in this query's context
Tracer.trace(authorized)
Tracer.trace(execute_field)
- Capturing field execution time in this query's context
Tracer.trace(authorized)
Tracer.trace(execute_field)
- in Post.id resolver
- Capturing field execution time in this query's context
Tracer.trace(execute_field)
- in Post.title resolver
- Capturing field execution time in this query's context
Tracer.trace(execute_query)
- Capturing execute time in this query's context
Tracer.trace(authorized)
Tracer.trace(execute_field)
- Capturing field execution time in this query's context
Tracer.trace(authorized)
Tracer.trace(execute_field)
- in Post.body resolver
- Capturing field execution time in this query's context
Tracer.trace(execute_query_lazy)
- NOTE: Unhandled.
Instrumentation.after_query - query_duration extracted for current query
- Making use of the start time set when the execute_query tracer event fired.
- Problem #3: Waiting until Instrumentation.after_query produces inaccurate query execution times. We should calculate query_duration in the Tracer instead, after the yield in capture_query_start_time (and rename that method for accuracy).
Instrumentation.after_query - query_duration extracted for current query
- Making use of the start time set when the execute_query tracer event fired.
- Problem #3.

Mar 16 '21 14:03 chrisbutcher

cc: @eapache @RobertWSaunders

Mar 16 '21 14:03 chrisbutcher

This explanation makes sense to me, though it raises one other question: I consistently see execute_multiplex running after lex and parse, and you seem to consistently see it running before. What's going on here?

Mar 16 '21 15:03 eapache

@eapache In those instances, are you passing .multiplex an already-parsed GraphQL::Query or a document string? This is my best guess as to why the order might differ.

Mar 16 '21 17:03 chrisbutcher

This is in core, which instantiates a GraphQL::Query instance with the string, and then calls .result on it. AFAICT this intentionally lexes/parses before it enters the multiplexer: https://github.com/rmosolgo/graphql-ruby/blob/5c3a8ec6f45319f40cb59f2ffb5961bb1071aa80/lib/graphql/query.rb#L198-L200

This may be a genuine inconsistency within the graphql-ruby gem depending on the entrypoint you use?

Yes: https://github.com/rmosolgo/graphql-ruby/issues/3393

Mar 16 '21 17:03 eapache

On second thought: Problems 1 and 2 might not actually be problems?

While it's true that the pre_context is not aware of mulitplexes (and multiple queries, contexts), we do run parse, lex and validate in that order, each time we run validate, we're using the values for parsing and lexing time that we last stored.

So long as graphql-ruby never runs something like parse (query 1), lex (query 1), parse (query 2), lex (query 2), validate (query 1), validate (query 2), and always does parse 1, lex 1, validate 1, parse n, lex n, validate n; we're ok?

Maybe the real problem is that if you pass a multiplex an unparsed document string and an already parsed query object (as one of our existing tests does), you don't always get a validate event that is preceded by a parse/lex events for the same query.

If that's true, maybe we just need to clear the pre_context after validate makes use of any values within?

tldr after sync:

As long as parse, lex and validate run in that order (without interleaving across queries in a multiplex) Problems #1 and #2 as described above are not real.
Problem #3 should be addressed by 1) still providing query execution time that includes the one-and-only execute_query_lazy call, shared between queries in a multiplex, and 2) emit a separate metric per query that exclude lazy execution time, by recording an "end" time after the yield in capture_query_start_time (again, to be renamed appropriately).
We should make all of this more clear with method/variable renames, comments etc. in Tracer in particular, to ensure it's clear what/when/why is being handed off between pre_context and context.
We should clear the pre_context immediately after obtaining values we need from it during validate, so we don't have a problem where a mixture of pre-parsed and parsed queries get incorrect values in validate, having assumed that lex/parse ran prior for the same query.

Mar 16 '21 20:03 chrisbutcher