timescaledb [Bug]: Compression Policy Stalling After Upgrade to TimescaleDB 2.16.2

What type of bug is this?

Crash

What subsystems and features are affected?

Compression

What happened?

TL;DR; -

My database's compression policy job (CALL _timescaledb_functions.policy_compression()) has not run successfully since I upgraded from version 2.15.3 to 2.16.2, approximately four months ago. It appears to get stuck on specific chunks during compression.

The issue has been resolved by running a SQL query to manually decompress/compress all hypertable chunks - see the SQL query at the bottom for more info.

Recently, I noticed that the periodic job CALL _timescaledb_functions.policy_compression() was running indefinitely.

So I looked at this job's stats via ...

select * from timescaledb_information.job_stats

job_id	application_name	schedule_interval	max_runtime	max_retries	retry_period	proc_schema	proc_name	owner	scheduled	fixed_schedule	config	next_start	initial_start	hypertable_schema	hypertable_name	check_schema	check_name
1014	Compression Policy [1014]	3 days 12:00:00	00:00:00	-1	01:00:00	_timescaledb_functions	policy_compression	postgres	true	false	{"hypertable_id": 61, "compress_after": "7 days"}	-infinity		public	sensor_reading_generic	_timescaledb_functions	policy_compression_check

... and saw that the last time it ran successfully was 4 months ago.

This timing corresponds with an upgrade of Timescale from version 2.15.3 to 2.16.2

I noticed that the compression policy query CALL _timescaledb_functions.policy_compression() was getting stuck on one particular hypertable chunk by running ...

with query as (
  select pid
    from pg_stat_activity
   where query = 'CALL _timescaledb_functions.policy_compression()'
)
select
    l.relation::regclass AS locked_relation,
    (chunk_schema || '.' || chunk_name)::regclass,
    c.range_start,
    l.mode,
    l.locktype,
    a.query,
    a.state,
    a.wait_event_type,
    a.wait_event
from
    pg_stat_activity a
    join pg_locks l  on l.pid = a.pid
    join query q on q.pid = l.pid
    left join timescaledb_information.chunks c on (chunk_schema || '.' || chunk_name)::regclass = l.relation::regclass

... while it was executing.

Note - the compression policy holds a lock on each hypertable it is compressing. So long as it is stuck on a specific hypertable, backfilling to this hypertable (i.e. importing historic readings) doesn't work work either, since INSERTs are blocked by this lock.

This hasn't been a major issue since backfilled historic data is mostly non-critical.

I found a relevant similar issue (thanks to ChatGPT 4o) that suggested manually compressing the chunk ...

Timescale Community Forum: In a forum post, a user described that all their compression policy jobs had been failing for approximately six weeks. The failures started after upgrading the Timescale extension. Manually running the jobs succeeded, but scheduled executions failed. The user sought advice on obtaining more detailed error information to diagnose the problem.

So I tried this via ...

select compress_chunk('_timescaledb_internal.<chunk_name>')

... and found that the query ran indefinitely.

So (dumbly) I tried decompressing the chunk manually via ...

select decompress_chunk('_timescaledb_internal.<chunk_name>')

... and it eventually decompressed after 40 minutes! Then I recompressed it, which this time took less than a minute!

I don't understand quite what went wrong or how this fixed it. ChatGPT 4o believed that this decompress/compress resolved metadata inconsistencies & bloat - but my understanding is not good enough to dig into that.

So I re-run the compression policy, this time it got stuck on another chunk.

So now I'm running decompress/compress on all chunks via ...

DO $$
DECLARE
    chunk RECORD;
BEGIN
    FOR chunk IN
        SELECT chunk_schema, chunk_name
        FROM timescaledb_information.chunks
        WHERE is_compressed = true
    LOOP
        -- Step 1: Decompress the chunk
        RAISE NOTICE 'Decompressing chunk: %.%', chunk.chunk_schema, chunk.chunk_name;
        EXECUTE format('SELECT decompress_chunk(''%I.%I'')', chunk.chunk_schema, chunk.chunk_name);

        -- Step 2: Compress the chunk
        RAISE NOTICE 'Compressing chunk: %.%', chunk.chunk_schema, chunk.chunk_name;
        EXECUTE format('SELECT compress_chunk(''%I.%I'')', chunk.chunk_schema, chunk.chunk_name);
    END LOOP;
END $$;

Fingers crossed this resolves it

TimescaleDB version affected

2.16.2

PostgreSQL version used

PostgreSQL 14.13, compiled by Visual C++ build 1940, 64-bit

What operating system did you use?

Windows Server 2019 Datacenter 17763.6532

What installation method did you use?

Other

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

This might be hard to do other than upgrading from version 2.15.3 to 2.16.2 on a Windows server (if I've understood the issue)

Nov 29 '24 12:11 rdmolony

Also note that (if I'm not mistaken) I should be able to set a timeout on the compression policy like ...

SELECT add_compression_policy('<hypertable-name>', INTERVAL '1 week');
WITH job AS (
    SELECT job_id
    FROM timescaledb_information.jobs
    WHERE hypertable_name = '<hypertable-name>'
        AND proc_name = 'policy_compression'
)
SELECT alter_job(
    job_id => job.job_id,
    max_runtime => INTERVAL '1 hour',
    max_retries => 3,
    retry_period => INTERVAL '10 minutes'
)

Nov 29 '24 15:11 rdmolony

I can confirm that after running the decompress/compress SQL query above compression jobs are now back running without any issues.

Dec 02 '24 11:12 rdmolony

@rdmolony Thank you for reporting this issue. Do you think this is issue is specific to Windows?

Dec 03 '24 08:12 erimatnor

Thanks @erimatnor, it could well be, I can't be sure

Dec 03 '24 12:12 rdmolony

I was also stuck with "CALL _timescaledb_functions.policy_compression()" I installed TimescaleDB some days ago and was not having issues until today TimescaleDB version affected: timescaledb-2-postgresql-14 (2.17.2~ubuntu22.04) PostgreSQL version used: (PostgreSQL) 14.12 (Ubuntu 14.12-0ubuntu0.22.04.1) How can we reproduce the bug? Followed https://docs.timescale.com/self-hosted/latest/install/installation-linux/ and inserted data for some days, the specific chunk that freezes policy_compression is from the biggest table and contains 57 million rows, also I had some out-of-space on disk past days that were solved before policy_compression started.

Used @rdmolony script but got stuck on: "SELECT decompress_chunk('_timescaledb_internal._hyper_25_27_chunk')" with a "Lock: relation" Followed https://docs.timescale.com/use-timescale/latest/compression/compression-policy/ and paused the compression policy first by finding its job_id and then disabling it, script operation finished then enabled it back.

Dec 04 '24 23:12 normalnormie

I was also stuck with "CALL _timescaledb_functions.policy_compression()" I installed TimescaleDB some days ago and was not having issues until today TimescaleDB version affected: timescaledb-2-postgresql-14 (2.17.2~ubuntu22.04) PostgreSQL version used: (PostgreSQL) 14.12 (Ubuntu 14.12-0ubuntu0.22.04.1) How can we reproduce the bug? Followed https://docs.timescale.com/self-hosted/latest/install/installation-linux/ and inserted data for some days, the specific chunk that freezes policy_compression is from the biggest table and contains 57 million rows, also I had some out-of-space on disk past days that were solved before policy_compression started.

Used @rdmolony script but got stuck on: "SELECT decompress_chunk('_timescaledb_internal._hyper_25_27_chunk')" with a "Lock: relation" Followed https://docs.timescale.com/use-timescale/latest/compression/compression-policy/ and paused the compression policy first by finding its job_id and then disabling it, script operation finished then enabled it back.

An update: I detected the same problem but this time decided to wait some days to see if it solved on its own and it did, my app was stopped for a couple hours so it was probably new inserts blocking the compression job.

Dec 15 '24 03:12 normalnormie

Unfortunately, there are instances where compression policy can take a significant amount of time to re-compress a chunk which is what I think people are hitting here. This one of the things we are actively working on improving. One of the improvements is landing in January: https://github.com/timescale/timescaledb/pull/7482

Meanwhile, decompressing and compressing these problematic chunks manually is your best bet.

There are plans to work on lock contention here as well which can be caused by the compression policy and inserts into the chunk which should be compressed. But in the meantime, I suggest adjusting the policy to run on chunks which don't have data inserted into them anymore. For instance, if you are inserting data over the last 3 days, compression policy should compress chunks which are at least older than that. I realize this cannot be done for all workloads and that's why we plan on addressing this in the near future.

Hope this makes sense.

Dec 23 '24 09:12 antekresic

Hi, we are experiencing similar issues with compression and compression policies across all four of our Timescale databases. The current issue occurs on TimescaleDB v2.17, but we have also observed similar behavior on v2.16 and v2.14. We have now set up a fifth Timescale 2.18 server solely to debug these compression issues.

Issue Overview

When we initially create the compression policy, it works as expected. However, after some time, it fails or runs indefinitely (4 days). When this happens, the only option is to decompress and recompress the entire table.
When it fails silently, no errors are logged, and chunks become corrupted.
We suspect that deadlocks from backfill upsert during compression, or parallel backfill upserts into compressed chunks might be causing policy failures and corrupted chunks, but this is just a guess.
Compression policy failures aren’t always logged, making it difficult to diagnose.

Error Logs from Corrupted Chunks

Here is an example warnings from a corrupted chunk that fails the compression policy when running the policy either manually run_job(job_id) or scheduled via automation framework.:

WARNING: no index on "insert_time" found for column range on chunk "_hyper_8_60915_chunk", DETAIL: column range works best with an index on the dimension.
WARNING: compressing chunk "_timescaledb_internal._hyper_8_60915_chunk" failed when compression policy is executed, DETAIL: Message: (type with OID 0 does not exist), Detail: ().
ERROR: compression policy failure, DETAIL: Failed to compress '14' chunks. Successfully compressed '1' chunks.

2025-02-12 14:20:42.759 EET [3701037] STATEMENT:  CALL run_job(1000)
2025-02-12 14:20:42.759 EET [3701037] WARNING:  no index on "insert_time" found for column range on chunk "_hyper_8_60915_chunk"
2025-02-12 14:20:42.759 EET [3701037] DETAIL:  column range works best with an index on the dimension.
2025-02-12 14:20:42.759 EET [3701037] CONTEXT:  SQL statement "SELECT public.compress_chunk( chunk_rec.oid )"
        PL/pgSQL function _timescaledb_functions.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean,boolean) line 66 at PERFORM
        SQL statement "CALL _timescaledb_functions.policy_compression_execute(
                job_id, htid, lag_value::INTERVAL,
                maxchunks, verbose_log, recompress_enabled, use_creation_time
              )"
        PL/pgSQL function _timescaledb_functions.policy_compression(integer,jsonb) line 60 at CALL
2025-02-12 14:20:42.772 EET [3701037] WARNING:  compressing chunk "_timescaledb_internal._hyper_8_60915_chunk" failed when compression policy is executed
2025-02-12 14:20:42.772 EET [3701037] DETAIL:  Message: (type with OID 0 does not exist), Detail: ().
2025-02-12 14:20:42.772 EET [3701037] CONTEXT:  PL/pgSQL function _timescaledb_functions.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean,boolean) line 72 at RAISE
        SQL statement "CALL _timescaledb_functions.policy_compression_execute(
                job_id, htid, lag_value::INTERVAL,
                maxchunks, verbose_log, recompress_enabled, use_creation_time
              )"
        PL/pgSQL function _timescaledb_functions.policy_compression(integer,jsonb) line 60 at CALL
2025-02-12 14:20:42.775 EET [3701037] ERROR:  compression policy failure
2025-02-12 14:20:42.775 EET [3701037] DETAIL:  Failed to compress '14' chunks. Successfully compressed '1' chunks.
2025-02-12 14:20:42.775 EET [3701037] CONTEXT:  PL/pgSQL function _timescaledb_functions.policy_compression_execute(integer,integer,anyelement,integer,boolean,boolean,boolean) line 120 at RAISE
        SQL statement "CALL _timescaledb_functions.policy_compression_execute(
                job_id, htid, lag_value::INTERVAL,
                maxchunks, verbose_log, recompress_enabled, use_creation_time
              )"
        PL/pgSQL function _timescaledb_functions.policy_compression(integer,jsonb) line 60 at CALL

Despite these warnings, the hypertable does have the required indexes (details below).

Compression and Indexing Configuration

We have a total of 1306 7-day chunks, out of which 16 are uncompressed. Uncompressed chunks doesn't have much data and take less than 1MB of space. Compressed chunks take around 150MB of space

We are using the following compression settings:

Hypertable	Chunk	Segment By	Order By
time_series	_timescaledb_internal._hyper_8_30_chunk	time_series_id	insert_time DESC, "time" DESC

Our indexes on time_series yet warnings indicate missing indexes on insert_time in some chunks.:

CREATE INDEX time_series_insert_time_idx ON public.time_series USING btree (insert_time DESC);
CREATE UNIQUE INDEX time_series_pkey ON public.time_series USING btree (time_series_id, "time");
CREATE INDEX time_series_time_idx ON public.time_series USING btree ("time" DESC);

Suspected Causes

1. Deadlocks or Race Conditions During Compression Policy Execution

We suspect deadlocks might be causing some compression policy failures and chunk corruption, but:
- These aren’t being logged properly.
- The compression policy runs indefinitely/freezes without reporting errors.
If the policy is modifying existing chunks while other operations run, this could be causing race conditions.

Request for Guidance

How can we get better visibility into why compression fails?
- Are there any specific logging options or internal TimescaleDB logs we can safely enable without impacting production performance?
- Can we improve monitoring of compression policy execution?
- We are currently setting up Grafana Prometheus Postgres-Exporter and Zabbix to gather statistics.
What is the best way to handle chunk corruption?
- Right now, our only option is to decompress and recompress the entire table, which is not ideal.
- Is there a safe way to identify and repair or reindex only affected chunks?

System Information

TimescaleDB Version: 2.17 (previously tested on 2.16 and 2.14)
PostgreSQL Version: 15.3 and 16
OS: Red Hat 9.4
Installation Method: YUM Package Manager

We would greatly appreciate any insights or recommended debugging steps. Thanks for your time! 🚀

Feb 13 '25 00:02 Dev10000

Do you have any results testing version 2.18? It should improve the situation.

How do you know these chunks are corrupt? What leads you to believe that? Are you able to query the data in them after the compression policy fails?

Its hard to say how to fix this corruption without some more details.

As for monitoring compression policy, you can add verbose logging to the policy

SELECT alter_job(1000, config => '{"hypertable_id": 1, "compress_after": "7 days", "verbose_log": true}');

Reference: https://docs.timescale.com/use-timescale/latest/user-defined-actions/alter-and-delete/#change-a-jobs-config

Mar 05 '25 11:03 antekresic

Hi @antekresic, I unfortunately cannot assist in this issue any longer as I no longer have access to the problematic database, and moreover once the issue was resolved the database was no longer in a problematic state.

(I'd love to have been able to create a reproducible issue, however, this issue depends upon database state which I am/was unable to share, good luck!)

Mar 07 '25 10:03 rdmolony

Hi @antekresic

Do you have any results testing version 2.18? It should improve the situation.

I didn't manage to do much testing with version 2.18, but with 2.19, I've already noticed improvements. For example:

I'm now able to multi-thread backfills on compressed chunks using timescaledb-parallel-copy without deadlocks
I can also run multi-thread backfills on compressed chunks while compression is running, which previously caused issues
These tests used data sorted by time in ascending order, starting from the oldest timestamps.

That said, my testing with 2.19 is still limited. I'm planning to test scenarios with completely unordered (mixed) input data, as well as concurrent parallel-copy runs from multiple sources, for example, 5 separate servers, each running parallel-copy with 6 workers, totaling 30 workers writing to the same Timescale table. All of these use cases have previously caused various issues, mainly deadlocks.

How do you know these chunks are corrupt? What leads you to believe that? Are you able to query the data in them after the compression policy fails?

I believe a chunk is corrupt when the automatic compression policy hangs on it and won’t proceed. The only workaround so far has been to manually decompress and recompress the chunk, but sometimes even that doesn’t help, as in this case with version 2.17.0.

As mentioned in my earlier report, we're seeing errors like:

type with OID 0 does not exist
no index on 'insert_time' found for column range on chunk

In this specific case, the chunk is already uncompressed, but attempting to compress it fails with the OID error, even manual compression doesn’t work. So I believe the OID error might be what's causing the compression policy to freeze.

I'm able to query the data normally, but the manual compression fails due to the OID error. This makes me think something is broken in the chunk's metadata or catalog references..

As for monitoring compression policy, you can add verbose logging to the policy
SELECT alter_job(1000, config => '{"hypertable_id": 1, "compress_after": "7 days", "verbose_log": true}');

Thank you for this. I’ll try it out next time we encounter issues with the policy. So far, in our another test DB, it’s been working well with 2.19.

Just to mention: I’m doing this testing as time allows, in parallel with other unrelated development work, so I may not be able to respond immediately or test everything right away. But I’ll continue reporting anything notable as we go.

Thanks again for the support!

Apr 07 '25 13:04 Dev10000

Assigning to @antekresic so that he can track testing results.

Apr 28 '25 14:04 gayyappan

Dear Author,

This issue has been automatically marked as stale due to lack of activity. With only the issue description that is currently provided, we do not have enough information to take action. If you have or find the answers we would need, please reach out. Otherwise, this issue will be closed in 30 days.

Thank you!

Jul 19 '25 02:07 github-actions[bot]

Dear Author,

We are closing this issue due to lack of activity. Feel free to add a comment to this issue if you can provide more information and we will re-open it.

Thank you!

Aug 18 '25 02:08 github-actions[bot]