timescaledb Add support for correlated constraints

Allow users to specify a specific column to be used as a correlated constraint using the add_dimension() API. We store such correlated constraints in the dimensions related timescaledb catalog tables. The "dimension" catalog table has been amended by adding a "type" column. This column now explicitly stores the type: open, closed or correlated as appropriate.

We create dimension_slice and chunk_constraint entries for chunks which have correlated constraints on them. The dimension slice entry will have -inf/+inf as start/end range initially for a given correlated constraint and the chunk_constraint entry will refer back to this slice entry.

This start/end range will be refreshed later. One of the entry points is during compression for now.

We can thus store the min/max values for such correlated contraints in these catalog tables at the per-chunk level. Note that correlated constraints do not participate in partitioning of the data. Such a correlated constraint will be used for chunk pruning if the WHERE clause of a SQL query specifies ranges on such a column.

Disable-check: force-changelog-file

Disable-check: commit-count

Mar 04 '24 10:03 nikkhils

Codecov Report

Attention: Patch coverage is 82.59109% with 43 lines in your changes are missing coverage. Please review.

Project coverage is 81.00%. Comparing base (59f50f2) to head (9c879ec). Report is 161 commits behind head on main.

:exclamation: Current head 9c879ec differs from pull request most recent head 03553c9. Consider uploading reports for the commit 03553c9 to get more accurate results

Files	Patch %	Lines
src/dimension.c	74.19%	15 Missing and 9 partials :warning:
src/dimension_slice.c	82.05%	6 Missing and 8 partials :warning:
src/chunk_constraint.c	94.59%	2 Missing :warning:
src/process_utility.c	71.42%	0 Missing and 2 partials :warning:
src/chunk_adaptive.c	88.88%	0 Missing and 1 partial :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6727      +/-   ##
==========================================
+ Coverage   80.06%   81.00%   +0.93%     
==========================================
  Files         190      198       +8     
  Lines       37181    37499     +318     
  Branches     9450     9788     +338     
==========================================
+ Hits        29770    30376     +606     
- Misses       2997     3208     +211     
+ Partials     4414     3915     -499

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 08 '24 07:03 codecov[bot]

Have you considered not storing these as dimensions. This seems to be quite an invasive change and it means that all existing dimension code needs to be checked whether it handles correlated constraints correctly.

May 06 '24 07:05 svenklemm

I have several issues with this PR. I don't think this can work by relying on adaptive chunking code which has an unproven track record and was written before compression even existed.

@antekresic

We are not relying on the adaptive chunking code at all. We have fixed the issues in the function that it used for calculating min/max for a column using indexes, nothing else.

Relying on indexes to update the constraint ranges will not work, doing heap scans on uncompressed chunk regions will also not work. Updating ranges needs to include the compressed data. Creating those ranges before compression also makes very little sense.

The ranges are calculated and added when we call compress_chunk and AFTER a strong lock has been taken on the chunk being compressed. How can the ranges change in that case?

Updating ranges by hooking into the function that sets the chunk as partial is going to hurt DML on compressed chunks which is already a big pain point.

We do NOT update the range, we just mark them INVALID once as part of marking the chunk partial. The range will be re-calculated next only when the partial chunk gets compressed again.

Needs a lot more testing, especially with updates on the column that has the correlated constraint set, I don't see such a test at all.

As mentioned above, the moment you do DML on a compressed chunk, you mark it partial, that's when we invalidate the range for correlated constraints. This is tested in the PR.

Finally, a tsbench run showing the performance improvement this promises is a must since this is an optimization.

I did manual runs with data shared by AlexK sometime ago and mentioned on the slack thread that we see benefits of chunk exclusion as expected for queries involving WHERE clauses on these correlated columns.

May 09 '24 09:05 nikkhils

closing this in favor of https://github.com/timescale/timescaledb/pull/7020

Jul 04 '24 14:07 nikkhils

timescaledb timescaledb copied to clipboard

Add support for correlated constraints

Codecov Report

timescaledb
timescaledb copied to clipboard