timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

[Bug]: Refresh of continuous aggregate doesn't write any data on equal "start_offset" and "schedule_interval"

Open lukaskirner opened this issue 2 years ago • 4 comments

What type of bug is this?

Incorrect result

What subsystems and features are affected?

Continuous aggregate

What happened?

By setting the start_offset and the schedule_interval of add_continuous_aggregate_policy to the same interval, no data will be written to the continuous aggregate. Lowering the schedule_interval will fix this but we require the same interval.

TimescaleDB version affected

2.13.0

PostgreSQL version used

15.5

What operating system did you use?

macOS

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted, Timescale Cloud

Relevant log output and stack trace

No error in log.

...
LOG:  refreshing continuous aggregate "agg_sensor_1min" in window [ 2024-01-23 17:02:00+00, 294247-01-02 00:00:00+00 ]
...

How can we reproduce the bug?

-- Create table
CREATE TABLE sensor (
	timestamp TIMESTAMPTZ NOT NULL,
	id TEXT NOT NULL,
	value INT NOT NULL,
	
	CONSTRAINT uk_test_timestamp_device_id UNIQUE (timestamp, id)
);

-- Create hypertable
SELECT create_hypertable('sensor', 'timestamp');

-- Create Aggregate
DROP MATERIALIZED VIEW agg_sensor_1min;
CREATE MATERIALIZED VIEW IF NOT EXISTS agg_sensor_1min WITH (timescaledb.continuous) AS 
    SELECT 
        time_bucket('1 minute'::interval, timestamp) AS bucket,
        id,
        AVG(value) AS avg_value
    FROM sensor
    GROUP BY bucket, id
WITH DATA;

-- Add refresh policy
SELECT add_continuous_aggregate_policy('agg_sensor_1min',
    start_offset => INTERVAL '1 minute',
    end_offset => NULL,
    initial_start => DATE_TRUNC('minute', CURRENT_TIMESTAMP),
    schedule_interval => INTERVAL '1 minute');

-- Check data. Should be empty
SELECT * FROM agg_sensor_1min;

-- Insert data. As many as you like
INSERT INTO sensor (timestamp, id, value) VALUES (CURRENT_TIMESTAMP, 'sensor0', 1);

-- Check data after one minute is over (refresh job was run) --> Still empty
SELECT * FROM agg_sensor_1min;

-- Check if Job was run successfully
SELECT * FROM timescaledb_information.job_stats WHERE job_id >= 1000;

lukaskirner avatar Jan 23 '24 16:01 lukaskirner

hi @lukaskirner, thanks for reaching out. Have you tried setting start_offset to ‘2 min’ instead of 1? I believe that should work for you. With your current settings, the policy is only considering data within the current minute it runs and later; so if you insert a value at e.g. 09:51:30 am, the next run of the policy will be at 09:52:00 and it will consider data from 09:52 onwards, not considering the value at 09:51:30. So I don’t think this is a bug. Can you let us know if changing the start_offset works for you?

konskov avatar Jan 24 '24 08:01 konskov

hi @konskov, by setting the start_offset to a higher value everything works fine. Also by setting the schedule_interval to lower than 1min will result into values. But from my understanding, the start_offset => INTERVAL '1 minute' will consider the data job_started_at - start_offset and not the future like you suggest. So in your example when the job runs at 09:52:00 the refresh window will be 09:51:00 to 09:52:00.

But by inspecting the logs the refresh window is like you suggest. When the job starts at 09:52:00 the refresh window in the log will reach into the future [ 2024-01-24 09:52:00+00, 294247-01-02 00:00:00+00 ]. By setting the start_offset => INTERVAL '1 minute', like you suggest, the log message will indicate that the window contains the last minute [ 2024-01-24 09:51:00+00, 294247-01-02 00:00:00+00 ].

Then the question arises. Why does the interval not directly subtract from the policy execution time like the API reference or this Link suggests:

start_offset INTERVAL or integer Start of the refresh window as an interval relative to the time when the policy is executed. NULL is equivalent to MIN(timestamp) of the hypertable.

lukaskirner avatar Jan 24 '24 08:01 lukaskirner

If we have a refresh interval of 15 seconds and a start offset of 1 minute 15 seconds, we can verify that the aggregates are correct. To me, this suggests that there is a bug with how the actual start time is calculated. It seems the interval is subtracted from the next start time instead of start time of the current refresh interval.

I.e. I would expect the following for start offset 1 minute and refresh interval 15 seconds

execution time of execution refresh start refresh end
1 10:00:00 09:59:00 (end of time)
2 10:00:15 09:59:00 (end of time)
3 10:00:30 09:59:00 (end of time)
4 10:00:45 09:59:00 (end of time)
5 10:01:00 10:00:00 (end of time)
6 10:01:15 10:00:00 (end of time)

But what we actually see is this:

execution time of execution refresh start refresh end
1 10:00:00 10:00:00 (end of time)
2 10:00:15 10:00:00 (end of time)
3 10:00:30 10:00:00 (end of time)
4 10:00:45 10:00:00 (end of time)
5 10:01:00 10:01:00 (end of time)
6 10:01:15 10:01:00 (end of time)

This way, incoming sensor data in the last 15 seconds of a minute is never accounted for.

To fix this, we would need to set a start offset of 1 minute 15 seconds, which incidentally is the interval between the time the following refresh runs and the start of the refresh interval that is observed in the logs.

This behavior directly contradicts the documentation from my understanding.

dtext avatar Jan 24 '24 09:01 dtext

Note to self: the reason for this behavior seems to be ts_compute_inscribed_bucketed_refresh_window_variable. As the policy runs at a few milliseconds after the minute starts, we compute the next bucket (inscribed) as the actual refresh start.

konskov avatar Jan 24 '24 13:01 konskov