timescaledb-extras icon indicating copy to clipboard operation
timescaledb-extras copied to clipboard

Draft: :sparkles: add a first working draft of auto_aggregate selection

Open Xasin opened this issue 1 year ago • 15 comments

This PR will add one new file which will create new functions to allow for the execution of an automatic downsampling of data for TimescaleDB hypertables and continuous aggregates.

Functionality of the downsampling method is documented in the utils/aggregate_selector.sql file. There are no dependencies with other systems, and no internal state of TimescaleDB is modified, as such there should be little conflict with pre-existing systems.

  • [x] Please confirm that the naming of the functions is appropriate for their use case, as I was not sure about existing naming conventions.
  • [x] Please also detail whether or not functions need to be documented in the docs/ folder, or if in-source documentation as provided is sufficient.

Xasin avatar Jun 03 '23 15:06 Xasin

Hi @Xasin, very good addition! So happy to see it!

Would you mind adding some example of this in action?

Probably improving the README with some instructions would be good for test, review and future users.

jonatas avatar Jun 07 '23 15:06 jonatas

On it! I'll copy the documentation and add an example section for this function in the docs/ folder, and add a link to the main README

Xasin avatar Jun 07 '23 20:06 Xasin

Preliminary docs entry and a link from the README to the docs file added. I'd appreciate some feedback on the description I am using for the function. I think it's a bit hard for me to understand if its purpose and functionality comes across properly, partly because I know the internals of the function quite well.

Xasin avatar Jun 07 '23 22:06 Xasin

@jonatas I think the problem there is that you are using a different schema than the "public" schema. The schema name can be given via the hypertable_schema named parameter.

I am assuming "public" is the default schema, but perhaps there is a better way, e.g. using the currently "active" schema, if such a thing exists?

What caught me off guard is that it's raising the wrong error. I thought I had a guard in check that specifically detects when it can't find a fitting table and give an appropriate error message, but it seems that it didn't work. I tweaked it so it now correctly says when it can't find the hypertable, the error message looks like this:

ERROR:  No fitting hypertable or aggregate found for given columns and table "public"."kernel"!
HINT:  Make sure the 'with_columns' field of the aggregate options is correct, and that `hypertable` and `hypertable_schema` are correct!

It seems to work fine on a Hypertable with no continuous aggregates for me.

Xasin avatar Jun 19 '23 19:06 Xasin

Thanks for the fix @Xasin.

Just trying it again:

CREATE TABLE ticks
( time TIMESTAMP NOT NULL,
    symbol varchar,
    price decimal,
    volume int);
SELECT create_hypertable('ticks', 'time');

And then:

SELECT *
FROM auto_downsample(
                                'ticks',
                                INTERVAL '10m',
                                $aggregate_options$
                                        [
                                                {"with_columns": ["price"],  "aggregate":"avg(price) AS value"}
                                        ]
                                $aggregate_options$,
                                'symbol',
                                $where_clause$
                                                                WHERE time BETWEEN NOW()-INTERVAL'30d' AND NOW()-INTERVAL'10d'
                                $where_clause$)

                                AS (time TIMESTAMP, symbol text, value DOUBLE PRECISION);

Now, after reloading I have the following error:

ERROR:  function auto_downsample(unknown, interval, unknown, unknown, unknown) does not exist
LINE 2: FROM auto_downsample(
             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

jonatas avatar Jun 19 '23 19:06 jonatas

@jonatas very strange, I am getting an error as well, but it's quite different.

When I run your commands 1:1 in my psql prompt (on a PostgreSQL 13.11 server), the error message is the following:

NOTICE:  Using parameter set {"aggregate": "avg(price) AS value", "table_name": "ticks", "table_schema": "public", "with_columns": ["price"], "table_interval": "00:00:00", "interval_matched": false}
ERROR:  structure of query does not match function result type
DETAIL:  Returned type character varying does not match expected type text in column 2.
CONTEXT:  PL/pgSQL function auto_downsample(text,interval,jsonb,text,text,text,text,boolean) line 42 at RETURN QUERY

But after fixing those by replacing the AS () with the proper types (i.e. CHARACTER VARYING and NUMERIC) it actually works fine for me. Maybe different PostgreSQL versions pull the AS ( ) statement into the function definition itself and throw a less understandable error on return type missmatch?

Xasin avatar Jun 19 '23 19:06 Xasin

I'm on Postgresql 14.7. Maybe we could leave a complete example running in the examples, just to allow people to try to replicate it and then adapt to their needs. Can you share the changes you did just to make sure I'm following all the replaces in the AS ( ) that you mentioned?

jonatas avatar Jun 20 '23 13:06 jonatas

@jonatas the edited "AS" clause has to have the exact same types as the return types of the query, so it looks like: AS (time TIMESTAMP, symbol character varying, value numeric); for the case you gave me. varchar and text are different, and the aggregation functions seem to return numeric.

I also confirmed that I have no other typedef of the function that might be hiding the issue, so I'm sure I only have the auto_downsample function version of the provided SQL file.

I'll update a fully working example in a moment.

Xasin avatar Jun 20 '23 14:06 Xasin

I'm really happy to help! It might come in quite handy at some of my work soon, which is very exciting. In any case, I am more than happy to give back to such a helpful too.

I don't think there's anything else from me left to change, so unless there's something else that's needed I think this can be merged in.

Xasin avatar Jun 20 '23 19:06 Xasin

Amazing! Thanks again! Adding Chris to have a second 👀 and if @d-sooter has any feedback let us know before we merge it.

jonatas avatar Jun 21 '23 22:06 jonatas

So i played around a bit with the function and i love it. And will be using it :) One issue i had is that i join my hypertable/Cagg on a metadata table. And since i dont know which table will be selected i cant really perform my join.

For my situation i just added a fixed alias to the table so that it can be referenced if needed in a join but im not sure if it would make sense to allow an alias to be passed for the table or to have a fixed alias

nimbit-software avatar Jun 25 '23 14:06 nimbit-software

@nimbit-software hm... That is an interesting problem. Would it help if I aliased the selected table to a common name inside the internal query, so that it's consistently referenceable for JOINs?

For now, can you try something for me? The way the WHERE clause query is injected into the SQL it's appended directly after the FROM clause, so perhaps if you start the filter clause with a AS table_name, the AS will be appended directly to the FROM table to give it an alias?

EDIT: Maybe that's what you already did. I think internally assigning a fixed table alias is a cleaner solution.

Xasin avatar Jun 26 '23 08:06 Xasin

so basically this is what i did.

i set a static alias "timeseries" and then i can perform my join. I can imagine cases where people would want to pass in their own alias, but then we have another input parameter and its starts getting a bit full.

query_construct := format($qry$ SELECT time_bucket(%L, %I) AS time, %s, %s FROM %I.%I as timeseries %s GROUP BY 1, %s ORDER BY 1 $qry$, data_interval, time_column, groupby_clause, aggregator_column, selected_parameters->>'table_schema', selected_parameters->>'table_name', filter_query, groupby_clause);

we could check the table name and if there is a space in it then split it up and set the alias.

CREATE OR REPLACE FUNCTION auto_downsample(hypertable TEXT, data_interval INTERVAL,
	aggregate_choices JSONB, 
	groupby_clause TEXT, filter_query TEXT,
	hypertable_schema TEXT DEFAULT 'public',
	time_column TEXT DEFAULT 'time',
	debug_query BOOLEAN DEFAULT FALSE)
RETURNS SETOF RECORD
LANGUAGE plpgsql
STABLE
AS $BODY$
DECLARE
	selected_parameters jsonb;
	aggregator_column TEXT;
	query_construct TEXT;
	hypertable_internal text;
  	hypertable_alias text;
  	parts text[];
BEGIN
	  IF hypertable ~* '\s+AS\s+' THEN
    	parts := regexp_split_to_array(hypertable, '\s+AS\s+', 'i');
	  ELSE
		parts := regexp_split_to_array(hypertable, '\s+', 'i');
	  END IF;

	  -- Extract the table name and alias
	  hypertable_internal := parts[1];
	  hypertable_alias := parts[array_length(parts, 1)];



	SELECT aggregate_choice(hypertable_internal, data_interval, aggregate_choices, hypertable_schema) INTO selected_parameters;

	IF selected_parameters IS NULL THEN
		RAISE EXCEPTION $$No fitting hypertable or aggregate found for given columns and table "%"."%"!$$,
			hypertable_schema, hypertable_internal
			USING HINT = $$Make sure the 'with_columns' field of the aggregate options is correct, and that `hypertable` and `hypertable_schema` are correct!$$
		RETURN;
	END IF;

	RAISE NOTICE 'Using parameter set %', selected_parameters;

	aggregator_column := selected_parameters->>'aggregate';

	IF aggregator_column IS NULL THEN
		RAISE EXCEPTION 'No aggregator given!' USING HINT = 'Supply a "aggregate" field in the JSON aggregate object'
		RETURN;
	END IF;

	query_construct := format($qry$
			SELECT time_bucket(%L, %I) AS time, %s, %s
			FROM %I.%I as %s
			%s
			GROUP BY 1, %s
			ORDER BY 1
		$qry$, data_interval, time_column, groupby_clause, aggregator_column,
		selected_parameters->>'table_schema', selected_parameters->>'table_name', hypertable_alias,
		filter_query,
		groupby_clause);

	IF debug_query THEN
		RAISE NOTICE 'Generated query output: %', query_construct;
	END IF;

	RETURN QUERY EXECUTE query_construct;
END;
$BODY$;

d-sooter avatar Jun 26 '23 10:06 d-sooter

@d-sooter that doesn't sound too bad. "timeseries" should be OK, but it almost sounds like a potential internal keyword?

It doesn't actually matter how many parameters the function has. PLPGSQL functions support named parameters, where individual parameters can be accessed through e.g. some_function(parameter_a => 10, parameter_c => 32) This keeps it fairly readable even with higher parameter counts, and if there is a "sane default" that most people will use (like how the default schema is public and the default time column name is time), it's not that bad to add more arguments.

I don't feel comfortable with the RegExp solution, it can get a bit messy and harder to understand than an explicit parameter like that.

Xasin avatar Jun 26 '23 14:06 Xasin

I had one last idea. I already implemented it in my system and it works good. That is gapfill

I added a parameter to the function

bucket_function TEXT DEFAULT 'time_bucket',

and then just updated the query

          query_construct := format($qry$
                        SELECT %I(%L, %I) AS time, %s, %s
                        FROM %I.%I AS %I
                        %s
                        GROUP BY 1, %s
                        ORDER BY 1
                    $qry$, bucket_function, data_interval, time_column, groupby_clause, aggregator_column,
                    selected_parameters->>'table_schema', selected_parameters->>'table_name',
                    table_alias,
                    filter_query,
                    groupby_clause);

Since the where is passed it it makes it easy to use the interpolation or locf functions as well.

I tested it and it works like a charm

nimbit-software avatar Jul 12 '23 12:07 nimbit-software