cubes Aggregate tables

Sometime it's better to have pre-aggregated measures built from the original fact table to make some common queries faster, especially when we explore cubes on a very big fact table.

Apr 09 '13 06:04 ghost

How would the fact that there is already pre-aggregated star available be presented in the form of metadata or options to the browser?

Is it part of the model? Should we start thinking about having in-database metadata where this information can be stored independently of the model? Something like a mapping between cube → (pre-aggregated dimensions, fact_name)?

I don't know yet, it would be good if we can throw some ideas here.

Jan 28 '15 05:01 Stiivi

As I understand this, a pre-aggregated fact table can be viewed as a totally independent definition of a cube model. Its relationship with the original non-aggregated cube is a single one: that the aggregate model is a strict subset of the original model. By strict subset I mean that the fact table has a subset of the measures available in the original cube, and the dimensions may have less hierarchy levels, or be totally absent.

Very simple example: I have a facts table and a date dimension, with [date, month, day] levels. My aggregate fact table is aggregated on month level, and as such it should be used when the date dimesion values contain [year, month].

An initial idea of implementation would be to require that the definition of the aggregated model is fully defined together with the original cube model, perhaps with an additional aggregates_cube attribute to mark the relationship, such as

"cubes": [{
  "name": "facts",
  "dimensions": ["date"]
}, {
  "name": "aggregated_facts",
  "aggregates_cube": "facts",
  "dimensions": [{"name": "date", "hierarchies": ["ym"]}]
}]

Loading the model would require validation to make sure that the aggregates_cube cubes are valid subsets of their target model. Then, the browser would have to be instantiated with the original cube together with its aggregates. Everytime it performs a query, it would invoke logic that takes the selected measures, cuts, and drilldowns as parameters and determines which facts table to query. A query only hitting the year dimension would never have to touch the facts table.

This is a pure optimization method. As such, I don't see the value of this information being available to the cube API's end user, unless it is for auditing/debugging.

Feb 24 '15 22:02 dtheodor

+1. I just tried cubes on a 5mil facts, 10 dimension dataset and performance was poor. Pre-aggregated measures would reduce the fact count exponentially for most queries.

Oct 29 '15 13:10 rdraju

Just an update: This is very import ant feature and requires model and backend redesign through concept of cube representations. There will be backward incompatible model changes though.

To provide a bit of insight in the idea, based on the example above by @dtheodor: the aggregated_facts will be a representation of the cube facts. There are still certain implementation details to be resolved, but I'll post more information as soon as I have a proof that the idea works as intended (also across backends).

Jun 19 '16 20:06 Stiivi

+1

Jul 05 '16 04:07 stevezau

+1

Mondrian does something like this too: http://mondrian.pentaho.com/documentation/aggregate_tables.php#What_are_aggregates

May 30 '17 20:05 wrschneider