graphql-engine icon indicating copy to clipboard operation
graphql-engine copied to clipboard

subscription is 223x slower than equivalent query

Open lovasoa opened this issue 3 months ago • 1 comments

Version Information

Server Version: hasura/graphql-engine:v2.48.5.cli-migrations-v3 CLI Version (for CLI related issue): N/A (Server-side issue)

Environment

OSS, running with Docker.

  • TimescaleDB: timescale/timescaledb:2.22.0-pg17
  • Hasura: hasura/graphql-engine:v2.48.5.cli-migrations-v3

What is the current behaviour?

When executing a GraphQL subscription on a TimescaleDB hypertable with a where clause that filters by the hypertable's partitioning column (e.g., id in this case), the generated SQL query is highly inefficient. The PostgreSQL query planner, when analyzing the Hasura-generated multiplexed SQL for subscriptions, performs a Custom Scan (ChunkAppend) followed by thousands of individual index scans, one for each chunk of the hypertable. This leads to extremely poor performance, especially with a large number of chunks.

This behavior is in stark contrast to a standard GraphQL query with an identical where clause, which generates an efficient SQL query that correctly utilizes TimescaleDB's chunk pruning, resulting in a single, fast index scan on the relevant chunk.

What is the expected behaviour?

The GraphQL subscription should generate efficient SQL that allows the PostgreSQL query planner (and TimescaleDB's chunk pruning mechanism) to effectively identify and scan only the relevant chunk(s) of the hypertable. The expected query plan for a subscription should be similar in efficiency to that of a regular query, avoiding the Custom Scan (ChunkAppend) over all chunks.

It appears that the way Hasura constructs the WHERE clause within its multiplexed subscription SQL (using UNNEST and LEFT OUTER JOIN LATERAL with references to __subs.result_vars #>> ARRAY[...]) prevents TimescaleDB's query planner from performing optimal chunk pruning.

How to reproduce the issue?

I set up a docker compose with a minimal reproduction here: https://github.com/lovasoa/hasura-timescale-bug

  1. Start the services using docker-compose up -d in the provided repository.
  2. Access the Hasura Console at http://localhost:8080 (admin secret: myadminsecretkey).
  3. Navigate to the "API" tab.
  4. Execute the provided GraphQL subscription with variables (e.g., min: 123456, max: 123456).
  5. Observe the significantly slow response time and analyze the generated SQL and its query plan in the Hasura console, noting the Custom Scan (ChunkAppend) and numerous index scans.
  6. For comparison, execute the provided GraphQL query with the same variables and observe its fast response and efficient query plan.

Screenshots or Screencast

Image

See the detailed query plans and generated SQL below, which clearly illustrate the performance discrepancy.

Please provide any traces or logs that could help here.

Hasura Engine Logs (relevant snippets for subscription)

{"detail":{"connection_info":{"msg":null,"token_expiry":null,"websocket_id":"8792d135-fcd2-498e-9fa4-96329369213f"},"event":{"detail":{"operation_id":"1","operation_name":"MyQuery","operation_type":{"type":"started"},"parameterized_query_hash":"97dcb06b5b1d4af29336f224f3894b8ef226bf66","query":null,"request_id":"b415cfa6-0628-4935-8f85-9e3eba88c0d3"},"type":"operation"},"user_vars":{"x-hasura-role":"admin"}},"level":"info","timestamp":"2025-09-17T20:21:25.458+0000","type":"websocket-log"}
{"detail":"Thread pollLiveQuery.PollerId {unPollerId = 8b5ebf7b-02a7-4ecd-afdb-e6430d0f6656} (re)started","level":"info","timestamp":"2025-09-17T20:21:25.458+0000","type":"unstructured"}
{"detail":{"event":{"detail":{"message":"{\"type\":\"data\",\"id\":\"1\",\"payload\":{\"data\":{\"conditions\" : [{\"value\":\"hello worl blah blah !\"}]}}}","message_size":97},"type":"message_sent"},"metadata":{"event_type":"data","operation_id":"1","operation_name":"MyQuery","parameterized_query_hash":"97dcb06b5b1d4af29336f224f3894b8ef226bf66","query_execution_time":0.378949767,"response_size":97},"websocket_id":"8792d135-fcd2-498e-9fa4-96329369213f"},"level":"debug","timestamp":"2025-09-17T20:21:26.459+0000","type":"ws-server"}

Here is a comparison between a regular query and a subscription.

Good performance with a query

GraphQL Query:

query MyQuery($min: Int, $max:Int) {
  conditions(where:{id:{_gte: $min, _lte: $max}}) {
    value
  }
}

Hasura-Generated SQL:

SELECT
  coalesce(json_agg("root"), '[]') AS "root"
FROM
  (
    SELECT
      row_to_json(
        (
          SELECT
            "_e"
          FROM
            (
              SELECT
                "_root.base"."value" AS "value"
            ) AS "_e"
        )
      ) AS "root"
    FROM
      (
        SELECT
          *
        FROM
          "public"."conditions"
        WHERE
          (
            (
              ("public"."conditions"."id") >= (('123456') :: integer)
            )
            AND (
              ("public"."conditions"."id") <= (('123456') :: integer)
            )
          )
      ) AS "_root.base"
  ) AS "_root"

Query Plan (Good - single index scan):

Aggregate  (cost=8.31..8.32 rows=1 width=32)
  ->  Index Scan using "124_124_conditions_pkey" on _hyper_1_124_chunk  (cost=0.28..8.29 rows=1 width=23)
        Index Cond: ((id >= 123456) AND (id <= 123456))
  SubPlan 1
    ->  Result  (cost=0.00..0.01 rows=1 width=32)

Poor performance with a subscription

GraphQL Subscription:

subscription MyQuery($min: Int, $max:Int) {
  conditions(where:{id:{_gte: $min, _lte: $max}}) {
    value
  }
}

Hasura-Generated SQL:

SELECT 
    "__subs"."result_id",
    "__fld_resp"."root" AS "result"
FROM UNNEST(($1)::UUID[], ($2)::JSON[]) AS "__subs"("result_id", "result_vars")
LEFT OUTER JOIN LATERAL (
    SELECT 
        json_build_object('conditions', "_conditions"."root") AS "root"
    FROM (
        SELECT 
            coalesce(json_agg("root"), '[]') AS "root"
        FROM (
            SELECT 
                row_to_json((
                    SELECT "_e"
                    FROM (
                        SELECT "_root.base"."value" AS "value"
                    ) AS "_e"
                )) AS "root"
            FROM (
                SELECT *
                FROM "public"."conditions"
                WHERE (
                    ("public"."conditions"."id") >= ( -- this json operation prevents the generation of an efficient query plan 
                        ("__subs"."result_vars" #>> ARRAY['synthetic','0'])::integer
                    )
                    AND ("public"."conditions"."id") <= ( -- this json operation prevents the generation of an efficient query plan 
                        ("__subs"."result_vars" #>> ARRAY['synthetic','1'])::integer
                    )
                )
            ) AS "_root.base"
        ) AS "_root"
    ) AS "_conditions"
) AS "__fld_resp"
ON ('true');

Query Plan (Bad - Custom Scan with thousands of index scans):

Nested Loop Left Join  (cost=42338.90..42338.94 rows=1 width=48)
  ->  Function Scan on __subs  (cost=0.01..0.01 rows=1 width=48)
  ->  Subquery Scan on _conditions  (cost=42338.89..42338.92 rows=1 width=32)
        ->  Aggregate  (cost=42338.89..42338.90 rows=1 width=32)
              ->  Custom Scan (ChunkAppend) on conditions  (cost=0.29..41963.80 rows=25006 width=23)
                    ->  Index Scan using "1_1_conditions_pkey" on _hyper_1_1_chunk  (cost=0.29..8.39 rows=5 width=23)
                          Index Cond: ((id >= ((__subs.result_vars #>> '{synthetic,0}'::text[]))::integer) AND (id <= ((__subs.result_vars #>> '{synthetic,1}'::text[]))::integer))
                    ->  Index Scan using "2_2_conditions_pkey" on _hyper_1_2_chunk  (cost=0.29..8.39 rows=5 width=23)
                          Index Cond: ((id >= ((__subs.result_vars #>> '{synthetic,0}'::text[]))::integer) AND (id <= ((__subs.result_vars #>> '{synthetic,1}'::text[]))::integer))

                        [... thousands of index scans...]

                    ->  Index Scan using "5000_5000_conditions_pkey" on _hyper_1_5000_chunk  (cost=0.29..8.39 rows=5 width=23)
                          Index Cond: ((id >= ((__subs.result_vars #>> '{synthetic,0}'::text[]))::integer) AND (id <= ((__subs.result_vars #>> '{synthetic,1}'::text[]))::integer))
                    ->  Bitmap Heap Scan on _hyper_1_5001_chunk  (cost=4.23..13.80 rows=6 width=32)
                          Recheck Cond: ((id >= ((__subs.result_vars #>> '{synthetic,0}'::text[]))::integer) AND (id <= ((__subs.result_vars #>> '{synthetic,1}'::text[]))::integer))
                          ->  Bitmap Index Scan on "5001_5001_conditions_pkey"  (cost=0.00..4.23 rows=6 width=0)
                                Index Cond: ((id >= ((__subs.result_vars #>> '{synthetic,0}'::text[]))::integer) AND (id <= ((__subs.result_vars #>> '{synthetic,1}'::text[]))::integer))
              SubPlan 1
                ->  Result  (cost=0.00..0.01 rows=1 width=32)

lovasoa avatar Sep 17 '25 20:09 lovasoa

edit: after more testing, this actually does not work. As soon as data from the lateral join is used, the optimizer is unable to access the right indices

Suggested Fix

The core issue lies in how the WHERE clause within the subscription query references the filtering parameters. The current method of extracting values from a JSON array (__subs.result_vars #>> ARRAY['synthetic','0']) prevents the PostgreSQL's query planner from effectively performing optimizations like TimescaleDB's chunk pruning. This is because the planner "sees" expressions it apparently cannot handle in the where conditions.

To enable proper query planning and chunk pruning, the parameters for the WHERE clause should be referenced directly as arrays or scalar values, rather than being extracted from a JSON object.

For example, if the parameters were directly available as an array (e.g., $2 being an array of arrays of json values), the WHERE` clause could be structured more efficiently as:

FROM
  UNNEST(
    ARRAY[gen_random_uuid()]::uuid[]
  ) WITH ORDINALITY AS "__subs"("result_id", "sub_idx")

...

WHERE "public"."conditions"."id" >= ($2[__subs.sub_idx][1])::integer
  AND "public"."conditions"."id" <= ($2[__subs.sub_idx][2])::integer

In my tests, this direct referencing allows the query planner to understand the constraints immediately, leading to efficient index scans and proper chunk pruning, similar to the performance observed in regular GraphQL queries. The goal is for Hasura to generate SQL where the WHERE clause conditions directly use the prepared statement parameters in a form that the PostgreSQL planner can optimize.

lovasoa avatar Sep 17 '25 21:09 lovasoa