Aggregate tables
Sometime it's better to have pre-aggregated measures built from the original fact table to make some common queries faster, especially when we explore cubes on a very big fact table.
How would the fact that there is already pre-aggregated star available be presented in the form of metadata or options to the browser?
Is it part of the model? Should we start thinking about having in-database metadata where this information can be stored independently of the model? Something like a mapping between cube → (pre-aggregated dimensions, fact_name)?
I don't know yet, it would be good if we can throw some ideas here.
As I understand this, a pre-aggregated fact table can be viewed as a totally independent definition of a cube model. Its relationship with the original non-aggregated cube is a single one: that the aggregate model is a strict subset of the original model. By strict subset I mean that the fact table has a subset of the measures available in the original cube, and the dimensions may have less hierarchy levels, or be totally absent.
Very simple example: I have a facts table and a date dimension, with [date, month, day] levels. My aggregate fact table is aggregated on month level, and as such it should be used when the date dimesion values contain [year, month].
An initial idea of implementation would be to require that the definition of the aggregated model is fully defined together with the original cube model, perhaps with an additional aggregates_cube attribute to mark the relationship, such as
"cubes": [{
"name": "facts",
"dimensions": ["date"]
}, {
"name": "aggregated_facts",
"aggregates_cube": "facts",
"dimensions": [{"name": "date", "hierarchies": ["ym"]}]
}]
Loading the model would require validation to make sure that the aggregates_cube cubes are valid subsets of their target model. Then, the browser would have to be instantiated with the original cube together with its aggregates. Everytime it performs a query, it would invoke logic that takes the selected measures, cuts, and drilldowns as parameters and determines which facts table to query. A query only hitting the year dimension would never have to touch the facts table.
This is a pure optimization method. As such, I don't see the value of this information being available to the cube API's end user, unless it is for auditing/debugging.
+1. I just tried cubes on a 5mil facts, 10 dimension dataset and performance was poor. Pre-aggregated measures would reduce the fact count exponentially for most queries.
Just an update: This is very import ant feature and requires model and backend redesign through concept of cube representations. There will be backward incompatible model changes though.
To provide a bit of insight in the idea, based on the example above by @dtheodor: the aggregated_facts will be a representation of the cube facts. There are still certain implementation details to be resolved, but I'll post more information as soon as I have a proof that the idea works as intended (also across backends).
+1
+1
Mondrian does something like this too: http://mondrian.pentaho.com/documentation/aggregate_tables.php#What_are_aggregates