timescaledb [Bug]: Timescale segfaults when backfilling data

What type of bug is this?

Crash

What subsystems and features are affected?

Data ingestion

What happened?

When backfilling data into a hypertable, we get a segfault and the server goes into recovery mode. We've encountered this multiple times, on all TimescaleDB versions since we started using it (which was at v2.11.1 if I'm not mistaken).

It should be noted that we've noticed that dropping old chunks before backfilling helped the backfill to progress further sometimes.

TimescaleDB version affected

2.13.1

PostgreSQL version used

15.5

What operating system did you use?

Ubuntu 22.04 LTS x64

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

dmesg logs:

postgres[3937358]: segfault at 140 ip 00007f82f81d6661 sp 00007ffe2cdeb020 error 6 in timescaledb-tsl-2.13.1.so[7f82f819e000+80000]


postgres logs:

2024-01-18 11:04:10.174 UTC [3909348] LOG:  server process (PID 3937358) was terminated by signal 11: Segmentation fault
2024-01-18 11:04:10.174 UTC [3909348] DETAIL:  Failed process was running: copy "logs" (<list of columns>) from stdin

How can we reproduce the bug?

I haven't been able to reproduce this locally with a small dataset. I'll try and get a proof-of-concept going, but I suspect it has to do with some weird corruption issue that might be hard/impossible to reproduce.

Jan 18 '24 12:01 iliastsa

hi @iliastsa, thank you for reaching out. Is it possible to share the schema and hypertable definition for the hypertable that you are getting the segfault with? Is that table compressed?

Jan 19 '24 07:01 konskov

Sure, here is the DDL + TimescaleDB / compression settings:

create table logs (
    column1  bigint not null,
    column2  int not null,
    column3  int not null,

    column4  int not null,
    column5  int,
    column6  bool not null,

    column7  bytea not null,
    column8  bytea,
    column9  bytea,
    column10 bytea,
    column11 bytea,
    column12 bytea null,
    primary key (column1, column2, column3)
);

select create_hypertable('logs', 'column1', chunk_time_interval => 300000, create_default_indexes => false);

alter table logs set (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'column7',
  timescaledb.compress_orderby = 'column1 desc, column2 desc, column4 desc'
);

Jan 19 '24 11:01 iliastsa

hello @iliastsa, we are trying to reproduce the error by inserting/deleting and COPYing into compressed chunks but unfortunately, we do not have a reproduction case so far. Have you been able to collect a coredump for this segfault? If so, would it be possible to get the stacktrace from the coredump with gdb and share it with us? That could be very useful in debugging

Jan 24 '24 12:01 konskov

Yeah I've also tried to reproduce it locally with inserts/deletes/COPYs but can't get it to crash. I don't have a coredump, I'll try and get one when we encounter the crash again.

Jan 29 '24 08:01 iliastsa

We are suddenly having what seems to be the same issue. This system has been running for 2 months with no issue and suddenly started getting this:

2024-05-16 10:43:09.371 BST [1] LOG:  database system is ready to accept connections
2024-05-16 10:43:09.374 BST [796] LOG:  TimescaleDB background worker launcher connected to shared catalogs
2024-05-16 10:43:09.884 BST [804] ERROR:  duplicate key value violates unique constraint "1332_828_tag_history_pkey"
2024-05-16 10:43:09.884 BST [804] DETAIL:  Key (tag_id, "time")=(8e2e78ff-46ff-59eb-a00c-019384ecbf15, 2024-05-16 10:43:04.197+01) already exists.
2024-05-16 10:43:09.884 BST [804] CONTEXT:  COPY tag_history, line 2
2024-05-16 10:43:09.884 BST [804] STATEMENT:  COPY wd.tag_history FROM STDIN
2024-05-16 10:43:15.319 BST [1] LOG:  server process (PID 801) was terminated by signal 11: Segmentation fault
2024-05-16 10:43:15.319 BST [1] DETAIL:  Failed process was running: COPY wd.tag_history FROM STDIN
2024-05-16 10:43:15.319 BST [1] LOG:  terminating any other active server processes
2024-05-16 10:43:15.322 BST [1] LOG:  all server processes terminated; reinitializing
2024-05-16 10:43:15.573 BST [808] LOG:  database system was interrupted; last known up at 2024-05-16 10:43:09 BST
2024-05-16 10:43:15.573 BST [809] FATAL:  the database system is in recovery mode

The duplicate key error is semi-expected - the issue is it should not crash postgres!

CREATE TABLE wd.tag_history (
    time TIMESTAMPTZ NOT NULL,
    tag_id UUID NOT NULL,
    quality INT,
    value_int BIGINT,
    value_bool BOOLEAN,
    value_float DOUBLE PRECISION,
    value_str TEXT,
    PRIMARY KEY(tag_id, time)
);
SELECT create_hypertable('wd.tag_history', 'time',
    if_not_exists => true,
    chunk_time_interval => interval '1 day'
);
ALTER TABLE wd.tag_history SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'tag_id'
);

Backtrace from core dump:

#0  0x000055b34ab09481 in tts_buffer_heap_getsomeattrs ()
#1  0x000055b34ab0b31e in slot_getsomeattrs_int ()
#2  0x00007fd0954a9c70 in slot_getsomeattrs (attnum=1, slot=0x7fd0952daa88) at /usr/local/include/postgresql/server/executor/tuptable.h:344
#3  slot_getattr (isnull=<synthetic pointer>, attnum=1, slot=0x7fd0952daa88) at /usr/local/include/postgresql/server/executor/tuptable.h:387
#4  build_scankeys (num_scankeys=<synthetic pointer>, slot=0x7fd0952daa88, null_columns=0x7ffdde145a38, key_columns=<optimized out>, decompressor=..., hypertable_relid=<optimized out>, hypertable_id=<optimized out>)
    at /build/timescaledb/tsl/src/compression/compression.c:1829
#5  decompress_batches_for_insert (cis=<optimized out>, chunk=<optimized out>, slot=0x7fd0952daa88) at /build/timescaledb/tsl/src/compression/compression.c:1974
#6  0x00007fd095590a0e in ts_chunk_dispatch_get_chunk_insert_state (dispatch=0x7fd1921639d0, point=0x7fd0951fe908, slot=0x7fd0952daa88, on_chunk_changed=on_chunk_changed@entry=0x0, data=data@entry=0x0)
    at /build/timescaledb/src/nodes/chunk_dispatch/chunk_dispatch.c:172
#7  0x00007fd09555a630 in TSCopyMultiInsertBufferFlush (miinfo=miinfo@entry=0x7ffdde145da0, buffer=buffer@entry=0x7fd0952fa088) at /build/timescaledb/src/copy.c:324
#8  0x00007fd09555a978 in TSCopyMultiInsertInfoFlush (miinfo=0x7ffdde145da0, cur_cis=0x7fd0953ce378) at /build/timescaledb/src/copy.c:527
#9  0x00007fd09555b155 in copyfrom (ccstate=ccstate@entry=0x7fd192163968, range_table=<optimized out>, ht=ht@entry=0x7fd095519498, callback=<optimized out>, arg=arg@entry=0x7fd0953ff5e8, copycontext=<optimized out>)
    at /build/timescaledb/src/copy.c:1124
#10 0x00007fd09555bc21 in timescaledb_DoCopy (stmt=<optimized out>, queryString=<optimized out>, processed=<optimized out>, ht=<optimized out>) at /build/timescaledb/src/copy.c:1417
#11 0x00007fd09556ba69 in process_copy (args=0x7ffdde145f70) at /build/timescaledb/src/process_utility.c:673
#12 0x00007fd09556bed1 in process_ddl_command_start (args=0x7ffdde145f70) at /build/timescaledb/src/process_utility.c:4224
#13 timescaledb_ddl_command_start (pstmt=0x7fd19217f4b8, query_string=<optimized out>, readonly_tree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=<optimized out>, dest=<optimized out>, 
    completion_tag=<optimized out>) at /build/timescaledb/src/process_utility.c:4467
#14 0x000055b34ac75cdc in PortalRunUtility ()
#15 0x000055b34ac75dfb in PortalRunMulti ()
#16 0x000055b34ac7631b in PortalRun ()
#17 0x000055b34ac72532 in exec_simple_query ()
#18 0x000055b34ac74107 in PostgresMain ()
#19 0x000055b34abf064a in ServerLoop ()
#20 0x000055b34abf1493 in PostmasterMain ()
#21 0x000055b34a974346 in main ()

Postgres version: 14.9 Timescale version: 2.12.2 (based on docker image timescale/timescaledb:2.12.2-pg14)

May 16 '24 10:05 SystemParadox

Ok it turns out in our case we were actually still on timescale 2.11.0. Running ALTER EXTENSION to actually update to 2.12.2 fixed this particular issue for us.

May 16 '24 12:05 SystemParadox

...except that the issue has reappeared after upgrading to 2.14.2.

May 16 '24 13:05 SystemParadox

Is the stack trace on 2.14.2 different? There was a similar problem that was fixed in 2.12, probably this is what you hit initially: https://github.com/timescale/timescaledb/pull/6117 But if it still fails on 2.14.2, maybe you're hitting something different now.

If you can make some snippet of data on which it reproduces always, that would be perfect, because I was not able to reproduce it or figure out a possible cause after some experiments.

May 17 '24 11:05 akuzm

Unfortunately this was all on a live system and I didn't manage to obtain a stack trace for 2.14.2.

May 19 '24 16:05 SystemParadox