timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

Duplicate rows returned when including certain time points in call to time_bucket_gapfill

Open genedavis opened this issue 2 years ago • 15 comments

Relevant system information:

  • OS: Ubuntu 20.04.3 LTS
  • PostgreSQL version (output of postgres --version): 13.4
  • TimescaleDB version (output of \dx in psql): timescaledb 2.4.2, timescaledb_toolkit 1.1
  • Installation method: apt install

Describe the bug Certain specific calls to time_bucket_gapfill return duplicate results. With a slight change in the time range, the problem goes away, but seems (in this case) to always exist when the time range overlaps '2020-11-14 12:10:00':

SELECT tag_name, time_bucket_gapfill('1 minute', resample_time, '2020-11-14 12:08:00', '2020-11-14 12:10:00') resample_time
        FROM
        (
                SELECT m.tag_name, time_bucket( '1 minute', tag_time)
                     as resample_time
                FROM tag_data m
                WHERE tag_time BETWEEN '2020-11-14 12:08:00' 
                      AND '2020-11-14 12:10:00'
	            AND m.tag_name = 'XXXX12345' 
                GROUP BY 1,2
        )mm
        GROUP BY 2,1 order by 2,1;

returns:

tag_name resample_time
XXXX12345 2020-11-14 12:08:00
XXXX12345 2020-11-14 12:08:00
XXXX12345 2020-11-14 12:09:00
XXXX12345 2020-11-14 12:09:00

(note that the tag name is anonymized)

To Reproduce Steps to reproduce the behavior: When I attempt to reproduce this with a smaller data set or simpler case I am unable to. For a subset of the data copied to a new table above ("tag_table_test"), running the same query returned the correct results.

Expected behavior I expected not to see duplicate rows across tag_name and resample_time. Note that the internal subquery (whether I get duplicates or not for the full query) always has the right number of rows.

Actual behavior Duplicate rows are seen (see above). Explain plans of versions that work and versions that do not work look identical except for the time range. Here is a query plan that generates duplicates (tag_name again anonymized in this writeup):

                                                                                                QUERY PLAN                                                                                                    
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort  (cost=67.92..68.92 rows=200 width=524)
  Sort Key: (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone)), mm.tag_name
  ->  Custom Scan (GapFill)  (cost=59.78..60.28 rows=200 width=0)
        ->  Sort  (cost=59.78..60.28 rows=200 width=0)
              Sort Key: mm.tag_name, (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone))
              ->  HashAggregate  (cost=49.63..52.13 rows=200 width=0)
                    Group Key: time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone), mm.tag_name
                    ->  Subquery Scan on mm  (cost=40.66..48.14 rows=299 width=524)
                          ->  HashAggregate  (cost=40.66..44.40 rows=299 width=524)
                                Group Key: m.tag_name, time_bucket('00:01:00'::interval, m.tag_time)
                                ->  Result  (cost=2.83..30.66 rows=2000 width=524)
                                      ->  Custom Scan (DecompressChunk) on _hyper_54_5832_chunk m  (cost=2.83..5.66 rows=2000 width=524)
                                            Filter: ((tag_time >= '2020-11-14 12:08:00'::timestamp without time zone) AND (tag_time <= '2020-11-14 12:10:00'::timestamp without time zone))
                                            ->  Index Scan using compress_hyper_58_5947_chunk__compressed_hypertable_58_tag_name on compress_hyper_58_5947_chunk  (cost=0.42..5.66 rows=2 width=72)
                                                  Index Cond: ((tag_name)::text = 'XXXX12345'::text)
                                                  Filter: ((_ts_meta_max_1 >= '2020-11-14 12:08:00'::timestamp without time zone) AND (_ts_meta_min_1 <= '2020-11-14 12:10:00'::timestamp without time zone))

Screenshots If applicable, add screenshots to help explain your problem.

Additional context tag_data table is a hypertable and is compressed

genedavis avatar Nov 09 '21 21:11 genedavis

@genedavis Thank you for the bug report. Running a simple example does not allow me to reproduce it. Do you have the table definition and some sample data that can reproduce the bug?

mkindahl avatar Nov 10 '21 08:11 mkindahl

Thank you @mkindahl - I haven't yet found a simple reproducing case. I'll work harder at that now. In the meantime I was thinking of trying possible workarounds, and I am open to suggestion: regenerating the table for example, or alternate queries.

genedavis avatar Nov 10 '21 14:11 genedavis

One update from above. I decompressed the chunk (identified in the explain plan) and I got the correct results this time. I will now attempt to recompress the chunk to see if it is still reproducible. Here is the (uncompressed) explain plan for reference:

                                                                                                                       QUERY PLAN                                                                                                                        
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=3466.64..3467.64 rows=200 width=524)
   Sort Key: (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone)), mm.tag_name
   ->  Custom Scan (GapFill)  (cost=3458.49..3458.99 rows=200 width=0)
         ->  Sort  (cost=3458.49..3458.99 rows=200 width=0)
               Sort Key: mm.tag_name, (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone))
               ->  Group  (cost=3448.11..3450.85 rows=200 width=0)
                     Group Key: (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone)), mm.tag_name
                     ->  Sort  (cost=3448.11..3448.85 rows=299 width=524)
                           Sort Key: (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone)), mm.tag_name
                           ->  Subquery Scan on mm  (cost=0.57..3435.81 rows=299 width=524)
                                 ->  Group  (cost=0.57..3432.08 rows=299 width=524)
                                       Group Key: m.tag_name, (time_bucket('00:01:00'::interval, m.tag_time))
                                       ->  Result  (cost=0.57..3416.36 rows=2994 width=524)
                                             ->  Index Only Scan using "5832_1744_tag_data_pkey" on _hyper_54_5832_chunk m  (cost=0.57..3378.93 rows=2994 width=524)
                                                   Index Cond: ((tag_name = 'XXXX12345'::text) AND (tag_time >= '2020-11-14 12:08:00'::timestamp without time zone) AND (tag_time <= '2020-11-14 12:10:00'::timestamp without time zone))

genedavis avatar Nov 10 '21 15:11 genedavis

Interestingly, recompressing the chunk brought the problem back again. This might suggest there are steps I can do to reproduce the problem. Side question: Are there any tools to query the chunk tables to help diagnose what could be going on?

genedavis avatar Nov 10 '21 15:11 genedavis

ok with a small amount of identical data - even after hypertabling and compressing, I can't reproduce. This leads me to think it needs more data to make a more complex plan to trigger the scenario. I need to focus on the workaround for a bit (decompressing the critical data) but will see if I can make a big enough case for reproducing without needing all 18B rows I have. If you have other suggestions or diagnostics I can do let me know.

genedavis avatar Nov 11 '21 01:11 genedavis

@genedavis It might be related to the insertion support for compressed tables. These rows are first added to a separate chunk, and then merged into the compressed chunk. It might be related to that. Did you insert any rows into the compressed chunk? Since you did a decompress and a recompress it makes things more complicated. After that, you should have a single compressed chunk.

mkindahl avatar Nov 11 '21 08:11 mkindahl

No rows were inserted into a compressed chunk. Here's what we did:

  1. create table
  2. define table as hypertable
  3. insert about 18B rows (edit: it's only 6B rows)
  4. set up compression rules
  5. running queries

The data has been static up to now.

genedavis avatar Nov 11 '21 14:11 genedavis

@genedavis Could you also post the plan which uses the compressed chunk (and returns wrong results)?

gayyappan avatar Nov 17 '21 18:11 gayyappan

@gayappan that was the first explain above. I was also requested to run an explain (analyze) so here it is:

tsdb=> EXPLAIN (ANALYZE) SELECT tag_name, time_bucket_gapfill('1 minute', resample_time, '2020-11-14 12:08:00', '2020-11-14 12:10:00') resample_time
        FROM
        (
                SELECT m.tag_name, time_bucket( '1 minute', tag_time) as resample_time
                FROM tag_data m
                WHERE (tag_time BETWEEN '2020-11-14 12:08:49' AND '2020-11-14 12:10:00') -- lower bound
                AND m.tag_name = 'XXXX12345' 
                GROUP BY 1,2
        )mm
        GROUP BY 2,1 order by 2,1;
                                                                                                                  QUERY PLAN                                                                                                                   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=67.92..68.92 rows=200 width=524) (actual time=0.226..0.228 rows=4 loops=1)
   Sort Key: (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone)), mm.tag_name
   Sort Method: quicksort  Memory: 25kB
   ->  Custom Scan (GapFill)  (cost=59.78..60.28 rows=200 width=0) (actual time=0.209..0.213 rows=4 loops=1)
         ->  Sort  (cost=59.78..60.28 rows=200 width=0) (actual time=0.208..0.209 rows=2 loops=1)
               Sort Key: mm.tag_name, (time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone))
               Sort Method: quicksort  Memory: 25kB
               ->  HashAggregate  (cost=49.63..52.13 rows=200 width=0) (actual time=0.197..0.199 rows=2 loops=1)
                     Group Key: time_bucket_gapfill('00:01:00'::interval, mm.resample_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone), mm.tag_name
                     Batches: 1  Memory Usage: 40kB
                     ->  Subquery Scan on mm  (cost=40.66..48.14 rows=299 width=524) (actual time=0.188..0.191 rows=2 loops=1)
                           ->  HashAggregate  (cost=40.66..44.40 rows=299 width=524) (actual time=0.187..0.190 rows=2 loops=1)
                                 Group Key: m.tag_name, time_bucket('00:01:00'::interval, m.tag_time)
                                 Batches: 1  Memory Usage: 37kB
                                 ->  Result  (cost=2.83..30.66 rows=2000 width=524) (actual time=0.086..0.136 rows=7 loops=1)
                                       ->  Custom Scan (DecompressChunk) on _hyper_54_5832_chunk m  (cost=2.83..5.66 rows=2000 width=524) (actual time=0.085..0.134 rows=7 loops=1)
                                             Filter: ((tag_time >= '2020-11-14 12:08:49'::timestamp without time zone) AND (tag_time <= '2020-11-14 12:10:00'::timestamp without time zone))
                                             Rows Removed by Filter: 1993
                                             ->  Index Scan using compress_hyper_58_5947_chunk__compressed_hypertable_58_tag_name on compress_hyper_58_5947_chunk  (cost=0.42..5.66 rows=2 width=72) (actual time=0.020..0.022 rows=2 loops=1)
                                                   Index Cond: ((tag_name)::text = 'XXXX12345'::text)
                                                   Filter: ((_ts_meta_max_1 >= '2020-11-14 12:08:49'::timestamp without time zone) AND (_ts_meta_min_1 <= '2020-11-14 12:10:00'::timestamp without time zone))
                                                   Rows Removed by Filter: 7

genedavis avatar Nov 17 '21 23:11 genedavis

@svenklemm @gayyappan is there anything else I can do to help debug? Is there a way to hint a slightly different plan (perhaps avoiding the index for example)? Thx

genedavis avatar Nov 19 '21 16:11 genedavis

Why not use the following query:

EXPLAIN (ANALYZE) SELECT tag_name, time_bucket_gapfill('1 minute', tag_time, '2020-11-14 12:08:00', '2020-11-14 12:10:00') resample_time
 FROM tag_data m
WHERE (tag_time BETWEEN '2020-11-14 12:08:49' AND '2020-11-14 12:10:00') -- lower bound
  AND m.tag_name = 'XXXX12345' 
GROUP BY 2,1 order by 2,1;

svenklemm avatar Nov 19 '21 16:11 svenklemm

Thanks @svenklemm that version ALSO gives duplicate records (at least compressed version) - uncompressed does not. There are reasons we do have the nested processing (and there might be simpler versions - no guarantees this would do it for us but maybe) - but this seems like a simpler version for "steps to reproduce"?

Here is the EXPLAIN (ANALYZE) for your version of that query:

tsdb=> explain (analyze) SELECT tag_name, time_bucket_gapfill('1 minute', tag_time, '2020-11-14 12:08:00', '2020-11-14 12:10:00') resample_time
 FROM tag_data m
WHERE (tag_time BETWEEN '2020-11-14 12:08:49' AND '2020-11-14 12:10:00') -- lower bound
  AND m.tag_name = 'OCIB.Kep.FI-9002-PPM.CPV' 
GROUP BY 2,1 order by 2,1;
                                                                                                         QUERY PLAN                                                                                                          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Custom Scan (GapFill)  (cost=56.69..58.19 rows=299 width=524) (actual time=0.226..0.230 rows=4 loops=1)
   ->  Sort  (cost=56.69..57.44 rows=299 width=0) (actual time=0.225..0.225 rows=2 loops=1)
         Sort Key: (time_bucket_gapfill('00:01:00'::interval, m.tag_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone))
         Sort Method: quicksort  Memory: 25kB
         ->  HashAggregate  (cost=40.66..44.40 rows=299 width=0) (actual time=0.216..0.219 rows=2 loops=1)
               Group Key: time_bucket_gapfill('00:01:00'::interval, m.tag_time, '2020-11-14 12:08:00'::timestamp without time zone, '2020-11-14 12:10:00'::timestamp without time zone), m.tag_name
               Batches: 1  Memory Usage: 37kB
               ->  Result  (cost=2.83..30.66 rows=2000 width=524) (actual time=0.098..0.157 rows=7 loops=1)
                     ->  Custom Scan (DecompressChunk) on _hyper_54_5832_chunk m  (cost=2.83..5.66 rows=2000 width=524) (actual time=0.097..0.154 rows=7 loops=1)
                           Filter: ((tag_time >= '2020-11-14 12:08:49'::timestamp without time zone) AND (tag_time <= '2020-11-14 12:10:00'::timestamp without time zone))
                           Rows Removed by Filter: 1993
                           ->  Index Scan using compress_hyper_58_5947_chunk__compressed_hypertable_58_tag_name on compress_hyper_58_5947_chunk  (cost=0.42..5.66 rows=2 width=72) (actual time=0.023..0.025 rows=2 loops=1)
                                 Index Cond: ((tag_name)::text = 'XXXX1234'::text)
                                 Filter: ((_ts_meta_max_1 >= '2020-11-14 12:08:49'::timestamp without time zone) AND (_ts_meta_min_1 <= '2020-11-14 12:10:00'::timestamp without time zone))
                                 Rows Removed by Filter: 7

genedavis avatar Nov 21 '21 01:11 genedavis

@genedavis Thanks for reporting this, could you also give some standalone script or steps to reproduce this issue, so that we can look into this further.

RafiaSabih avatar Jul 26 '22 16:07 RafiaSabih

This might be fixed by #4526 which is included in 2.7.2 do you still see the problem with 2.7.2?

svenklemm avatar Aug 14 '22 12:08 svenklemm

@svenklemm this looks promising, we are in the throes of an initial release, will let you know once we can test

genedavis avatar Aug 29 '22 18:08 genedavis

Dear Author,

This issue has been automatically marked as stale due to lack of activity. With only the issue description that is currently provided, we do not have enough information to take action. If you have or find the answers we would need, please reach out. Otherwise, this issue will be closed in 30 days. Thank you!

github-actions[bot] avatar Oct 29 '22 02:10 github-actions[bot]

Dear Author,

We are closing this issue due to lack of activity. Feel free to add a comment to this issue if you can provide more information and we will re-open it. Thank you!

github-actions[bot] avatar Nov 29 '22 02:11 github-actions[bot]

I have this issue with PostgreSQL 15.5 and timescaledb 2.13.1. Using time_bucket_gapfill('1 minute' and I happen to have some data points which are exactly on the minute. These entries get duplicated.

jflambert avatar Jan 15 '24 18:01 jflambert