timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

Join on the time column of a compress chunk can fail

Open mickael-choisnet opened this issue 3 years ago • 15 comments

Relevant system information:

  • OS: Debian GNU/Linux 10 (buster)
  • PostgreSQL version (output of postgres --version): postgres (PostgreSQL) 12.6 (Debian 12.6-1.pgdg100+1)
  • TimescaleDB version (output of \dx in psql): 2.1.1
  • Installation method: sudo apt-get install timescaledb-2-postgresql-12

Describe the bug We have design an hypertable 'releves' to store daily meter measures:

CREATE TABLE IF NOT EXISTS public.releves (
    numero_serie TEXT, 
    constructeur TEXT, 
    date_mesure TIMESTAMP NOT NULL, 
    type_mesure SMALLINT, 
    unite_de_base SMALLINT, 
    valeur_mesure DOUBLE PRECISION, 
    exposant_10 SMALLINT, 
    detail_donnee SMALLINT, 
    entree_mesure SMALLINT,
    type_appareil SMALLINT,
    source SMALLINT,
    type_dispositif SMALLINT
);
SELECT create_hypertable('releves', 'date_mesure', chunk_time_interval => '14 DAYS'), if_not_exists => TRUE);
ALTER TABLE releves SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'constructeur, numero_serie'
);
SELECT add_compression_policy('releves', '90 DAYS');

We are using your backfill procedure to perform insert into the last 12 months of compressed chunks once a day (at 0:00 UTC when the server can easily take the overload of work). The process is working well, however we randomly fall upon an unexpected behavior. Some random chunk after having being compressed again will throw an error when being queried with a join upon the time column of the hypertable. Error encountered: attribute 3 of type _timescaledb_internal.compress_hyper_5_2569_chunk has wrong type Error details: Table has type _timescaledb_internal.compressed_data, but query expects timestamp without time zone.

One way we have found to fix this is to decompress the involved chunk and then compress it again. Although some times we have to do so more than once to fix the issue.

The kind of query that will throw an error:

SELECT 	p.compteur_id, r.date_mesure, r.valeur_mesure,	r.unite_de_base
FROM releves r
JOIN patrimoine p	
	ON p.constructeur = r.constructeur 
	AND p.numero_serie = r.numero_serie 
	AND p.date_pose <= r.date_mesure
WHERE r.date_mesure >= '2020-06-01' AND r.date_mesure <= '2021-04-01'
	AND p.local_id = '1234567';

The struggle is with the join clause between 'patrimoine' and 'releves' upon the dates (date_pose and date_mesure). If I remove that part and only keep the where part involving date_mesure I don't see any error.

To Reproduce I am not sure this behavior can easily be reproduced on any hypertable but it seems to be linked with the backfill procedure (duration? post background job?). I have tried to reproduce it with the next steps, but the error only occur temporarily and seems to resolve itself in this particular case scenario:

  1. Run the SELECT decompress_chunk('_timescaledb_internal.chunk_name') over on of your compressed chunk
  2. Run the SELECT compress_chunk('_timescaledb_internal.chunk_name') to compress it again
  3. Run a query with a join on the time column and with a period that should include the time range of the chunk

I got the error for some seconds directly after the compression but then it disappear which is not the case in our production environment where the backfill is performed during the night and the error is still here in the morning.

Expected behavior Queries across compressed chunk should always work.

Actual behavior After having being decompress and compress again some chunk can be in a state where the time column is no longer recognized as a valid timestamp.

Additional context We first encountered this error under timescaledb version 2.0.1 and we upgdraded to the 2.1.1 yesterday afternoon in order to check if it has been spot and resolved since the 2.0.1.

mickael-choisnet avatar Apr 01 '21 11:04 mickael-choisnet

Hmm looks like the decompression is not working, can you post the EXPLAIN for the failing query?

svenklemm avatar Apr 03 '21 09:04 svenklemm

Hi @svenklemm, I just have run the query with an EXPLAIN command but it does not return any other information than the error I already gave in my report. As you can see in the screen below: image

This error can also be seen into the postgres log file for the select statement and the EXPLAIN statements.

As I have said the query will work if I remove the AND p.date_pose <= r.date_mesure and in this case I get an EXPLAIN but I'm not sure it will help understand the error. In any case the EXPLAIN from the modified query is here: https://explain.depesz.com/s/IX3G

mickael-choisnet avatar Apr 03 '21 19:04 mickael-choisnet

your screenshot shows you are running explain (analyze) which does explain but also tries to execute the query. Does just running EXPLAIN without ANALYSE and BUFFERS work?

svenklemm avatar Apr 03 '21 19:04 svenklemm

Ok I didn't think about it but you are right it does work when I remove ANALYSE and BUFFERS options from the EXPLAIN command. Please find the result here: https://explain.depesz.com/s/ZYUK

mickael-choisnet avatar Apr 03 '21 20:04 mickael-choisnet

Do you still get the error if you disable bitmapscan?

set enable_bitmapscan to false;

svenklemm avatar Apr 11 '21 00:04 svenklemm

Hi @svenklemm, I've just tried it and the answer is no. I do not get the error when enable_bitmapscan is set to false.

mickael-choisnet avatar Apr 11 '21 09:04 mickael-choisnet

Hi,

Do you have any news regarding this issue? Is it safe for performance to set enable_bitmapscan to off in a production environment? What is this parameter for? How can I know the potential impacts?

mickael-choisnet avatar Sep 03 '21 06:09 mickael-choisnet

I would consider it safe. It will prevent bitmap scans from being used but bitmap scans are not as common and ideally you only have sequential scan and index scan, but this heavily depends on your schema and the types of queries you are running.

svenklemm avatar Sep 05 '21 20:09 svenklemm

I'd say that a program error like this is definitely a bug and probably high priority. Good that we have a workaround, bad that we don't know how to reproduce it.

akuzm avatar Oct 28 '21 11:10 akuzm

Just to warn any other people who might have this problem, setting enable_bitmapscan to off is not a safe workaround. The query will not throw an exception, but it will never end and eventually it will time out (depending on how the client performing the query and the server are configured). We encountered this behavior last week and thought it was this issue. By setting this parameter to on again, the error was there.

At least with the exception the problem is clearly identified and can be addressed.

We are planning to upgrade Timescale extension to the 2.4.2 version or higher and rework our solution to insert directly into the compressed chunks. It will solve the insert part, but we will still need to decompress the chunks to delete rows. Looking forward to being able to delete and update into compressed chunks as well.

mickael-choisnet avatar Oct 31 '21 09:10 mickael-choisnet

@choisnetm We really need more info to reproduce this problem.

I tried to reproduce the issue on master (we are at 2.5.1 release now). I tried creating a "promione" table (based on the query since we don't have the table definition). I added some data to "releves" and "promione" table. Then I tried the join query. My query runs without errors and I cannot repro this issue. ( I see a BitmapHeapscan in the plan, but the query runs successfully).

Do you still have this problem on the latest timescale release? If you do, could you provide a reproducible test case? Please include all create table + index stmts + compression information.

gayyappan avatar Jan 20 '22 16:01 gayyappan

I'm on PG13 and timescale 2.6.0 and seem to have found a potential method for reproducing this issue. The issue for me seems to only occur when filters specify a larger time range, spanning multiple chunks (or no time filter at all) and then performing a join from a hypertable to a standard table using a compressed column as part of the join condition.

set enable_bitmapscan to false; does fix the issue in my case as well, at least temporarily

Explain with bitmapscan disabled shows an index scan of the compressed chunks --> decompress --> append --> join With bitmapscan enabled, some chunks still do an index scan before decompression, but others do: bitmap index scan --> bitmap heap scan --> decompress

It seems that the DB is choosing to do a bitmap scan or not depending on the number of chunks being queried, but it isn't consistently applying the same logic. If the time range filter is decreased, the bitmap scan is removed even though the DB wanted to use a bitmap scan on the same chunk in a previous larger query.

The chunks are all ~3-4GB before compression and ~300MB after compression. Each chunk contains between 10-30 million rows.

joeyberkovitz avatar Mar 24 '22 15:03 joeyberkovitz

Hey there @joeyberkovitz it will be helpful for us to progress on this one if you could provide some script or standalone steps for reproducing this issue.

RafiaSabih avatar Jul 26 '22 13:07 RafiaSabih

@RafiaSabih @gayyappan

This same issue has started to occur for us. After some trial and error, I've managed to come up with a series of SQL statements that can reproduce it. The trick is getting the data into a state that forces PostgreSQL to use a bitmap scan versus an index or sequence one. Too much data it uses a sequence, and too little uses index.

The following was run on a timescale/timescaledb:2.5.0-pg12 Docker image.

CREATE TABLE IF NOT EXISTS metrics (
  "time" timestamp WITH time zone NOT NULL,
  device_id int NOT NULL,
  reading float NOT NULL
);

SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '1 day');

INSERT INTO
  metrics
SELECT 
  time, device_id, random()*100 as cpu_usage 
FROM 
  generate_series(
    now() - INTERVAL '20 days',
    now(),
    INTERVAL '1 second'
  ) as time, 
  generate_series(1,10) device_id
;

ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby = 'device_id');

CREATE TABLE metrics_temp AS SELECT * FROM metrics WHERE time <= now() - interval '10 days' ORDER BY time DESC LIMIT 1;

SELECT compress_chunk(c, if_not_compressed => true) FROM show_chunks('metrics') c;

--- this query triggers the error
SELECT
  metrics_temp.*
FROM
  metrics_temp
INNER JOIN
  metrics
ON
  metrics_temp.time = metrics.time AND
  metrics_temp.device_id = metrics.device_id
WHERE
  metrics.device_id < 100 OR metrics.device_id > 500 --- including this is what helps triggers the bitmap scan
LIMIT 1
;

The output of the above:

ERROR:  attribute 1 of type _timescaledb_internal.compress_hyper_2_22_chunk has wrong type
DETAIL:  Table has type _timescaledb_internal.compressed_data, but query expects timestamp with time zone.

Note: I also reproduced this on the timescale/timescaledb:latest-pg12 Docker image as well. Same thing happens.

dustinsorensen avatar Aug 25 '22 15:08 dustinsorensen

Hello @dustinsorensen,

Thank you for providing the steps needed to reproduce the issue. I had to modify them slightly to be able to reproduce the issue reliably. However, I could also reproduce the issue with the current development version (2.9.0-dev) of TimescaleDB on PostgreSQL 14.

test2=# SELECT * FROM version();
                                              version                                              
---------------------------------------------------------------------------------------------------
 PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
(1 row)

test2=# \dx
                                       List of installed extensions
    Name     |  Version  |   Schema   |                            Description                            
-------------+-----------+------------+-------------------------------------------------------------------
 plpgsql     | 1.0       | pg_catalog | PL/pgSQL procedural language
 timescaledb | 2.9.0-dev | public     | Enables scalable inserts and complex queries for time-series data
(2 rows)

CREATE TABLE IF NOT EXISTS metrics (
  "time" timestamp WITH time zone NOT NULL,
  device_id int NOT NULL,
  reading float NOT NULL
);

SELECT create_hypertable('metrics', 'time', chunk_time_interval => INTERVAL '1 day');

INSERT INTO
  metrics
SELECT 
  time, device_id, random()*100 as cpu_usage 
FROM 
  generate_series(
    now() - INTERVAL '20 days',
    now(),
    INTERVAL '1 second'
  ) as time, 
  generate_series(1,10) device_id;

ALTER TABLE metrics SET (timescaledb.compress, timescaledb.compress_segmentby = 'device_id');

CREATE TABLE metrics_temp AS SELECT * FROM metrics WHERE time <= now() - interval '10 days' ORDER BY time DESC LIMIT 1;

SELECT compress_chunk(c, if_not_compressed => true) FROM show_chunks('metrics') c;

SELECT
  metrics_temp.*
FROM
  metrics_temp
INNER JOIN
  metrics
ON
  metrics_temp.time = metrics.time AND
  metrics_temp.device_id = metrics.device_id
LIMIT 1;

             time              | device_id |      reading      
-------------------------------+-----------+-------------------
 2022-09-02 13:57:45.508386+02 |         1 | 61.97267052259434
(1 row)


SET enable_bitmapscan to on;
SET enable_indexscan to off;

SELECT
  metrics_temp.*
FROM
  metrics_temp
INNER JOIN
  metrics
ON
  metrics_temp.time = metrics.time AND
  metrics_temp.device_id = metrics.device_id
LIMIT 1;

ERROR:  attribute 1 of type _timescaledb_internal.compress_hyper_16_7438_chunk has wrong type
DETAIL:  Table has type _timescaledb_internal.compressed_data, but query expects timestamp with time zone.

jnidzwetzki avatar Sep 12 '22 13:09 jnidzwetzki

Still seems to be a problem for me on pg14 & ts 2.8.1. The difference for me is that I have a view on the compressed table and I'm selecting from the view and do a full join on a non compressed table.

set local enable_bitmapscan to false; seems to go around the problem. ping @svenklemm

xvaara avatar Oct 26 '22 15:10 xvaara

Do you have a self-contained test case to reproduce this?

svenklemm avatar Oct 26 '22 17:10 svenklemm

@svenklemm it seems it goes away if I vacuum full the table... also SET enable_indexscan to false; made it go away.

It's some weird side effect that I can't reproduce, but I can give you a copy of the database, I've removed most of the non relevant data from it. I'll hit you up on slack.

xvaara avatar Oct 28 '22 12:10 xvaara