How to combine data from cube pre-agg with data from a cube without a pre-agg?
Problem
We're trying to figure out how to combine data from a pre-aggregated cube with data from a non pre-aggregated cube. One cube has tons of data which we'd want to be accelerated by pre-aggs but the other is relatively small and can update frequently so we'd want the most up to date data.
Related Cube.js schema
cube('events', {
sql_table: 'public.events',
data_source: 'default',
dimensions: {
event_id: {
sql: 'event_id',
primary_key: true,
public: false,
type: 'string',
},
timestamp: {
sql: 'timestamp',
type: 'time',
},
event: {
sql: 'event',
type: 'string',
},
user_id: {
sql: 'user_id',
type: 'string',
},
// ...
},
measures: {
clicks: {
type: 'count',
filters: [{ sql: `${CUBE}.event = 'click'` }],
},
// ...
},
joins: {
users: {
sql: `${users}.user_id = ${events}.user_id`,
relationship: 'one_to_many',
},
},
pre_aggregations: {
daily_events: {
dimensions: [CUBE.user_id, CUBE.event],
measures: [CUBE.clicks],
time_dimension: CUBE.timestamp,
granularity: 'day',
partition_granularity: 'month',
},
},
});
cube('users', {
sql_table: 'public.users',
data_source: 'default',
dimensions: {
user_id: {
sql: 'user_id',
primary_key: true,
},
name: {
sql: 'name',
type: 'string',
},
// ...
},
});
In this example, we'd want to query the number of clicks for each user and have the events side accelerated by the pre-aggregation.
However, in it's current form, the pre-aggregation is never hit for the event data.
Related Cube.js generated SQL
SELECT
"users".name "users__name",
count(
distinct CASE
WHEN ("events".event = 'click') THEN "events".event_id
END
) "events__clicks"
FROM
public.events AS "events"
LEFT JOIN public.users AS "users" ON "users".user_id = "events".user_id
WHERE
(
"events".timestamp >= $ 1 :: timestamptz
AND "events".timestamp <= $ 2 :: timestamptz
)
GROUP BY
1
ORDER BY
2 DESC
LIMIT
10000
Any help would be hugely appreciated.
@benskz isn't this the case for lambda pre aggregations? we are facing the same problem and this seems to be the implementation we need
@benskz isn't this the case for lambda pre aggregations? we are facing the same problem and this seems to be the implementation we need
I believe lambda pre aggs are meant for combining pre aggregated data with new data from the same source. For example, you'd get past data from pre agg and then combine it with today's, real time, data.
We are wanting to combine a pre aggregated source with a non pre aggregated source. For example, combining pre aggregated event data with non pre aggregated user data.
In this example, we'd want to query the number of clicks for each user and have the events side accelerated by the pre-aggregation.
@benskz Just add dimensions from the users cube to the pre-aggregation so that Cube can serve your queries fully from pre-aggregations.
In principle, combining pre-aggregated data and non-pre-aggregated data is a tough challenge since the former is stored and queried via Cube Store and the latter is stored and queries by your database.
@igorlukanin thanks for the response.
The issue is that the users data could change quite frequently, in this case would you still recommend having it in the pre-agg? That would mean constantly rebuilding pre-aggs, or maybe we are just thinking about this wrong.
Well, I believe, there's no singular right approach here.
One way is to include all necessary dimensions in the pre-aggregation and have it rebuilt as the data changes—you'll be basically throwing more compute into the problem.
Another would be to include only dimensions that do not change (e.g., user.id) and join the data on the application level. This consumes less compute but shifts the burden of joining the data downstream.
A third one would be to place a full-fledged query engine upstream of Cube (e.g., Trino), add something like Airflow to pre-aggregate data the way you want, and then consume it from Cube. Same, but upstream.
There might be more. Personally, I'd probably go with the first one and see how it works before implementing anything more complex. It can work quite well even if you have some slowly changing dimensions (e..g, user name)—if your measures are additive, you can deduplicate the data on the application side anyway.
Thanks, we'll explore the different approaches and see what could work for us.