hatchet High CPU load from find_matching_tenants_in_task_*_tmp

We’re seeing a very large number of Postgres queries generated by the system, all starting with the same two CTEs (shown below). They are being called ~300k times per day each and are responsible for a significant portion of our database CPU time, even though they usually return zero rows. On a 2 vCPU, 8 GB Postgres 14.19 server these queries are close to saturating the instance.

Postgres: 14.19 (managed postgres) DB instance: 2 vCPU, 8 GB RAM hatchet: v0.73.66

Workload characteristics Number of workers / tasks: Only about 10 tasks running every couple of minutes for couple of weeks. Overall load: Very small; there is not much work happening on the workers. This makes the observed query volume and CPU usage surprising.

What we’re seeing From pg_stat_statements (screenshot attached), the top two queries by total load are:

Query based on find_matching_tenants_in_task_events_tmp_partition(...):


WITH
  tenants AS (
    SELECT
      UNNEST(find_matching_tenants_in_task_events_tmp_partition(
        $1::int,
        $2::UUID[]
      )) AS tenant_id
  ),
  locked_events AS (
    SELECT
      e.tenant_id,
      e.requeue_after,
      e.requeue_retries,
      e.id,
      e.task_id,
      e.task_inserted_at,
      e.event_type,
      e.readable_status,
      e.retry_count,
      e.worker_id
    FROM
      tenants t,
      LATERAL list_task_events_tmp(
        $1::int,
        t.tenant_id,
        $3::int
      ) e
  ),

Query based on find_matching_tenants_in_task_status_updates_tmp_partition(...):

WITH
  tenants AS (
    SELECT
      UNNEST(find_matching_tenants_in_task_status_updates_tmp_partition(
        $1::int,
        $2::UUID[]
      )) AS tenant_id
  ),
  locked_events AS (
    SELECT
      u.tenant_id,
      u.requeue_after,
      u.requeue_retries,
      u.id,
      u.dag_id,
      u.dag_inserted_at
    FROM
      tenants t,
      LATERAL list_task_status_updates_tmp(
        $1::int,
        t.tenant_id,
        $3::int
      ) u
  ),

Metrics Calls per day (approx.): Query 1: ~303,581 calls/day Query 2: ~301,212 calls/day Average execution time: Query 1: ~48 ms Query 2: ~41 ms Average rows returned: 0 for both queries (in our workload) These two queries dominate DB CPU usage on this instance. Impact Our 2 CPU / 8 GB Postgres 14.19 server can barely handle the load. A large proportion of CPU time is spent on these queries even though they frequently return no rows. This limits how much other application traffic we can handle without scaling up the DB.

Is this query pattern and frequency expected under normal operation?

Nov 18 '25 12:11 stbt-rbtbar

Additionally, we’ve observed that two worker processes are consistently opening the maximum allowed number of database connections, and keeping them open all the time. In practice, this means we’re holding almost 100 connections continuously without a clear functional reason.

Nov 18 '25 12:11 stbt-rbtbar

Hey @stbt-rbtbar, note that we test against Postgres 15, 16, and 17, so I'm wondering if perhaps there are some issues on PG 14 related to querying partitioned tables. But I'm also quite surprised by the CPU usage here.

One thing you can do to reduce the usage is setting SERVER_OPERATIONS_POLL_INTERVAL to a high value, like 10 or 20 (these are seconds). This will slow down the frequency of task status updates in the UI, so it may be a little more delayed, but will reduce volume.

I am also wondering if you could provide table bloat statistics on the tables referenced in this query? Something like:

SELECT 
  schemaname,
  relname,
  n_dead_tup,
  n_live_tup,
  round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
  last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY dead_pct DESC NULLS LAST
LIMIT 20;

If we're seeing high bloat on any tables which contain olap, such as v1_task_events_olap_tmp_0, v1_task_events_olap, v1_runs_olap, there are some ways to tune the autovac behavior which might be helpful.

Nov 26 '25 17:11 abelanger5

Thanks a lot for the detailed pointers and the query 🙏

We’ve made a couple of changes on our side:

We’ve increased SERVER_OPERATIONS_POLL_INTERVAL to 30 seconds to reduce the polling frequency and hopefully take some pressure off CPU. The extra delay in the UI is acceptable for us for now.

We’re going to look into upgrading this instance from Postgres 14 to a more recent version (likely 16), so we’ll be closer to what you test against and see whether that changes the behavior with partitioned tables.

In the meantime, we’ll also review our autovacuum settings for these OLAP tables.

Here are the current bloat stats from running your query (top entries):

schemaname relname public v1_tasks_olap_20251124_running public v1_tasks_olap_20251103_running public v1_tasks_olap_20251130_running public v1_tasks_olap_20251122_queued public v1_tasks_olap_20251125_queued public v1_tasks_olap_20251119_queued public v1_runs_olap_20251106_queued public v1_runs_olap_20251110_queued public v1_tasks_olap_20251201_queued public v1_tasks_olap_20251102_queued public v1_runs_olap_20251102_queued public v1_tasks_olap_20251105_running public v1_tasks_olap_20251127_running public v1_tasks_olap_20251109_running public v1_runs_olap_20251115_queued public v1_runs_olap_20251122_queued public v1_runs_olap_20251201_queued public v1_tasks_olap_20251112_queued public v1_runs_olap_20251119_queued public v1_runs_olap_20251114_running n_dead_tup n_live_tup dead_pct last_autovacuum 4 0 100.00 2025-11-24 23:57:06.552458+00 20 0 100.00 2025-11-03 23:43:19.430203+00 3 0 100.00 2025-11-30 23:57:47.585767+00 17 0 100.00 2025-11-22 23:55:08.793642+00 12 0 100.00 2025-11-25 23:55:46.814985+00 6 0 100.00 2025-11-19 23:57:28.639483+00 20 0 100.00 2025-11-06 23:53:23.348493+00 17 0 100.00 2025-11-10 23:54:09.204139+00 9 0 100.00 2025-12-01 23:56:20.943145+00 9 0 100.00 2025-11-02 23:56:09.337116+00 9 0 100.00 2025-11-02 23:56:09.356845+00 8 0 100.00 2025-11-05 23:54:18.200976+00 2 0 100.00 2025-11-27 23:59:07.862162+00 23 0 100.00 2025-11-09 23:42:27.465524+00 23 0 100.00 2025-11-15 23:52:21.957128+00 6 0 100.00 2025-11-22 23:58:08.64197+00 6 0 100.00 2025-12-01 23:57:21.329192+00 17 0 100.00 2025-11-12 23:54:44.472116+00 3 0 100.00 2025-11-19 23:58:28.208149+00 20 0 100.00 2025-11-14 23:47:34.669639+00

The absolute number of dead tuples is small, and autovacuum appears to be running regularly, but we’ll still tune the autovac/autonalyze thresholds for these OLAP partitions as you suggested.

If this all looks reasonable from your side and you don’t see any other red flags, I’m happy to close the issue.

Dec 02 '25 09:12 stbt-rbtbar

High CPU load from find_matching_tenants_in_task_*_tmp_partition queries on Postgres 14.19