timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

Change in format for numeric type

Open mkindahl opened this issue 2 years ago • 1 comments

Background

When implementing aggregation functions, they proceed by generating internal state and then apply a finalization function to "complete" the aggregation. To support parallel aggregation, there is also support for partial aggregate, which is also an internal intermediate state that can be passed between processes when computing an aggregate. Continuous aggregates use this format and store it into the materialization table directly as a byte array.

The internal format for the aggregation state and the partial aggregates are intended to be passed between processes. However, continuous aggregates use this format to serialize the values and write it as a byte array in the table and then use the COMBINE function to combine different partials into a final form using _timescaledb_internal.finalize_agg.

The numeric format changed between PG13 and PG14 to include infinities (https://github.com/postgres/postgres/commit/a57d312a7706321d850faa048a562a0c0c01b835), which means that the serialized format that is used by continuous aggregates will not work after upgrading to PG14.

A check was added in #4369 so that when upgrading the TimescaleDB version will generate a warning about the format change but there is no way to capture it when upgrading the PG version.

Problems

The problems described here are only present if the following two conditions are met:

  • The partial form of the continuous aggregates is used. If the finalized form implemented in #4269 is used, this condition is not met.
  • The numeric type is used in the continuous aggregate, that is, it is used in such a way that numeric has to be stored in the materialized hypertable.

If these two conditions are met for a continuous aggregate, we have the following problems that we need to consider:

  1. If a user upgrades from PG13 to PG14, continuous aggregates will immediately stop working and generate an error as shown below. Since there is no way to "downgrade" PG versions, it is difficult to undo this operation (it is possible if taking a backup before and restoring it to the old version).
  2. Apart from the error being strange, it is a simple length error for the serialized internal value, so it is not entirely clear if there are combinations where lengths will match before and after the upgrade, but still generate incorrect values.
  3. Migrating to the finals form eliminate the problem, but if large amounts of data is stored in the continuous aggregate it might be prohibitive to transform the format to the finals form, which means that mitigation 1 below might not be feasible to use.
  4. Transforming the continuous aggregate to the PG14 format solves the problem, but for large amounts of data, it might be prohibitive to update from pre-PG14 format to PG14 format, which means that mitigation 2 below might not be feasible.
  5. Patching the format "on the fly" according to mitigation affects execution time. Some experiments from implementation of #4427 shows a degradation of 30% for this particular operation (but that does not translate to 30% slowdown of queries that do this since there are more factors involved than just patching the format).

The following query can be used to detect if there are any continuous aggregates that have numeric columns:

SELECT attrelid::regclass
  FROM _timescaledb_catalog.continuous_agg, pg_attribute
 WHERE attrelid = format('%I.%I', user_view_schema, user_view_name)::regclass
   AND atttypid = 'numeric'::regtype
GROUP BY attrelid::regclass
HAVING COUNT(attname) > 0;

Mitigations

We need to make sure that it is possible to handle the following mitigations.

  1. To allow users to solve the issue before updating we should add support for translating the partials form to the finals form. This will allow users to change the format of the continuous aggregate before upgrading the PostgreSQL version.
  2. To allow users to fix the problem after upgrading, we need to add support for fixing the format of partials in version PG14.
  3. To allow users to use pre-PG13 partials at later versions, we need to have different finalize_agg functions for pre-PG14 and PG14.
  4. To allow users to read the old format, we can support fixing the deserialized format "on the fly" before running the combine function.

TimescaleDB version affected

All

PostgreSQL version used

14.x

Relevant log output and stack trace

=# SELECT * FROM cagg ;
ERROR:  08P01: insufficient data left in message
LOCATION:  pq_copymsgbytes, pqformat.c:533

How can we reproduce the bug?

  1. Install any version that supports PG13 and PG14 on a PG13 build.
  2. Create a continuous aggregate with numeric type.
  3. Upgrade to PG14
  4. Run the select above.

Related tasks

  • [x] #4369
  • [x] #4427
  • [x] #4376
  • [x] #4424
  • [ ] #4428

mkindahl avatar May 31 '22 10:05 mkindahl

https://github.com/timescale/timescaledb/issues/4427#issuecomment-1156427272

mfundul avatar Jun 16 '22 11:06 mfundul