router icon indicating copy to clipboard operation
router copied to clipboard

Optional instrumentation for recording GraphQL response field lengths in OTel

Open tninesling opened this issue 1 year ago • 2 comments

Overview

Adds a new instrumentation config, graphql, which supports a single metric called field.length. When enabled, this will publish the lengths of array fields returned in primary supergraph responses. This is primarily meant to help debug unexpected cost values calculated by the demand control plugin, as these discrepancies are multiplied by the length of lists in the responses.

Primary responses only

Note that this implementation does not work for deferred responses. The primary blocker for this is that we don't currently have a way to zip a response with a query when that response doesn't start at the query root. To make this work, we would need to take the deferred response's json path and determine which subsection of the schema we should use for the zip procedure.

No support for custom attributes

The other instrumentation configurations support custom metrics using predefined attributes, for example, you can create a custom router metric based on the http response status code. This functionality comes from the custom histogram/attribute/selector framework we've implemented, but this GraphQL field-related code does not seem to fit cleanly into those existing abstractions. In the interest of time, I've settled on creating this one-off metric which is not extensible and cannot be used in custom metrics.

No support for conditions

One change not included in this PR that we will need to add is support for filtering via conditions. This metric will be published for every list field across all responses when enabled, which has the potential to produce far more information than is useful or wanted. The existing conditions implementation is likely not compatible with this implementation as-is because we need to check a given condition for each field in the response when determining if we should publish the metric or not. The current conditions setup will cache any evaluated condition, such that if the condition is true once, it will be rewritten to a static true condition that will not be re-evaluated. We will need to create some uncached equivalent which can be evaluated several times within a single request pipeline to be used with this field length metric. That will be coming in the next PR.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • [X] Changes are compatible[^1]
  • [ ] Documentation[^2] completed
  • [ ] Performance impact assessed and acceptable
  • Tests added and passing[^3]
    • [X] Unit Tests
    • [ ] Integration Tests
    • [ ] Manual Tests

Exceptions

Note any exceptions here

Notes

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

tninesling avatar May 17 '24 21:05 tninesling

@tninesling, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

github-actions[bot] avatar May 17 '24 21:05 github-actions[bot]

CI performance tests

  • [x] step - Basic stress test that steps up the number of users over time
  • [ ] events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • [ ] large-request - Stress test with a 1 MB request payload
  • [ ] events - Stress test for events with a lot of users and deduplication ENABLED
  • [ ] xxlarge-request - Stress test with 100 MB request payload
  • [ ] events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • [ ] xlarge-request - Stress test with 10 MB request payload
  • [ ] step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • [ ] events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • [ ] no-graphos - Basic stress test, no GraphOS.
  • [ ] reload - Reload test over a long period of time at a constant rate of users
  • [ ] events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • [ ] events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • [x] const - Basic stress test that runs with a constant number of users

router-perf[bot] avatar May 17 '24 21:05 router-perf[bot]

This was redone in #5215

tninesling avatar May 28 '24 15:05 tninesling