timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

Change materialization hypertable indexes

Open fabriziomello opened this issue 2 years ago • 5 comments

Nowadays for the Continuous Aggregate materialization hypertable we add additional indexes. Those additional indexes are one for each group-by column plus time bucket expression and it can lead to unsusable indexes.

This PR change the current logic to add just one additional composite index including all group-by columns plus the time bucket expression.

fabriziomello avatar May 18 '22 19:05 fabriziomello

Codecov Report

Merging #4354 (94ca9c6) into main (047d6b1) will increase coverage by 0.04%. The diff coverage is 94.05%.

:exclamation: Current head 94ca9c6 differs from pull request most recent head c10f033. Consider uploading reports for the commit c10f033 to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4354      +/-   ##
==========================================
+ Coverage   90.79%   90.84%   +0.04%     
==========================================
  Files         216      217       +1     
  Lines       39953    40264     +311     
==========================================
+ Hits        36275    36576     +301     
- Misses       3678     3688      +10     
Impacted Files Coverage Δ
src/compat/compat.h 94.73% <ø> (ø)
src/loader/bgw_launcher.c 91.82% <ø> (ø)
src/planner/add_hashagg.c 54.28% <ø> (+1.42%) :arrow_up:
src/planner/planner.h 100.00% <ø> (ø)
src/ts_catalog/catalog.h 100.00% <ø> (ø)
src/utils.h 82.60% <ø> (ø)
tsl/src/init.c 95.45% <ø> (ø)
src/telemetry/telemetry.c 81.25% <16.66%> (-1.46%) :arrow_down:
src/cross_module_fn.c 69.28% <85.71%> (+0.31%) :arrow_up:
tsl/src/reorder.c 85.29% <90.32%> (+0.07%) :arrow_up:
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0a68209...c10f033. Read the comment docs.

codecov[bot] avatar May 18 '22 19:05 codecov[bot]

I don't quite understand the reasoning behind this change. Are we saying that those indices are useless now due to the final form? In that case we could remove them altogether.

Do we have a use-case in mind where the composite index is useful?

The problem is the current implementation create one index for each aggregated column plus time bucket, for example the following continuous aggregate:

CREATE MATERIALIZED VIEW continuous_aggregates
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 day', column1),
  column2,
  column3,
  column4,
  SUM(column5)
FROM
  raw_hypertable
GROUP BY
  time_bucket('1 day', column1),
  column2,
  column3,
  column4;

It will create 4 indexes in the materialization hypertable:

  • indx1 (time_bucket DESC)
  • indx2 (column2, time_bucket DESC)
  • indx3 (column3, time_bucket DESC)
  • indx4 (column4, time_bucket DESC)

And depending of the queries over the continuous aggregate some indexes will become useless (not used), and turning into a single composite index we're trying to avoid this problem as much we can.

With this PR we'll end up with just two indexes:

  • indx1 (time_bucket DESC)
  • indx1 (column2, column3, column4, time_bucket DESC)

It was discussed some time ago during one of our meetings /cc @svenklemm @gayyappan

fabriziomello avatar May 26 '22 19:05 fabriziomello

I don't quite understand the motivation behind this change.

  1. What is this supposed to provide, more space savings by not using too many indices? Performance shouldn't increase with this change.
  2. Isn't this codepath supposed to be unreachable because users are not supposed to use the old format? *skip_adding = finalized; would lead to never creating those indices.

mfundul avatar Jun 02 '22 13:06 mfundul

Let's not merge this before we are sure there are no performance regressions, since we're presenting performance numbers for 2.7.0 and we wouldn't want to risk invalidating them this quickly.

Well... if gratuitously adding indexes to caggs is the reason for the improved performance, I don't think this necessarily is a good thing. Indexes take space and also affect performance when updating or inserting. For the same reason as why indexes are not added to tables in general, I think that we should avoid creating them unless there is a good reason. I do not think that the fact that they can affect our published benchmarks is a reason to hold back on merging.

It would, however, be good if we could redo the benchmarks and compare with and without this commit, just to understand where we stand.

mkindahl avatar Jun 07 '22 06:06 mkindahl

Let's not merge this before we are sure there are no performance regressions, since we're presenting performance numbers for 2.7.0 and we wouldn't want to risk invalidating them this quickly.

I wouldn't be horribly worried about this, the index we're replacing these indexes with is as likely to be better than the others as not, it could cause performance differences in some cases, but I don't think it's horribly likely to be a problem. Especially when we combine it with #4430 , where users can create their own indexes on continuous aggregates, I think any performance issues would be pretty transitory. Aside from that this will not impact the benchmark numbers as far as I can tell because the same indexes will be created in the benchmark case as there is only one group by column in addition to the time bucket, this only impacts continuous aggregates that have multiple group by columns.

davidkohn88 avatar Jun 11 '22 19:06 davidkohn88

Closing because the benchmarks I've made in the past didn't prove the performance improvements. We can revisit it in the future.

fabriziomello avatar Oct 06 '23 20:10 fabriziomello