cube icon indicating copy to clipboard operation
cube copied to clipboard

How to combine data from cube pre-agg with data from a cube without a pre-agg?

Open benskz opened this issue 10 months ago • 2 comments

Problem

We're trying to figure out how to combine data from a pre-aggregated cube with data from a non pre-aggregated cube. One cube has tons of data which we'd want to be accelerated by pre-aggs but the other is relatively small and can update frequently so we'd want the most up to date data.

Related Cube.js schema

cube('events', {
  sql_table: 'public.events',

  data_source: 'default',

  dimensions: {
    event_id: {
      sql: 'event_id',
      primary_key: true,
      public: false,
      type: 'string',
    },

    timestamp: {
      sql: 'timestamp',
      type: 'time',
    },

    event: {
      sql: 'event',
      type: 'string',
    },

    user_id: {
      sql: 'user_id',
      type: 'string',
    },

    // ...
  },

  measures: {
    clicks: {
      type: 'count',
      filters: [{ sql: `${CUBE}.event = 'click'` }],
    },

    // ...
  },

  joins: {
    users: {
      sql: `${users}.user_id = ${events}.user_id`,
      relationship: 'one_to_many',
    },
  },

  pre_aggregations: {
    daily_events: {
      dimensions: [CUBE.user_id, CUBE.event],
      measures: [CUBE.clicks],
      time_dimension: CUBE.timestamp,
      granularity: 'day',
      partition_granularity: 'month',
    },
  },
});

cube('users', {
  sql_table: 'public.users',

  data_source: 'default',

  dimensions: {
    user_id: {
      sql: 'user_id',
      primary_key: true,
    },

    name: {
      sql: 'name',
      type: 'string',
    },

    // ...
  },
});

In this example, we'd want to query the number of clicks for each user and have the events side accelerated by the pre-aggregation.

However, in it's current form, the pre-aggregation is never hit for the event data.

Related Cube.js generated SQL

SELECT
  "users".name "users__name",
  count(
    distinct CASE
      WHEN ("events".event = 'click') THEN "events".event_id
    END
  ) "events__clicks"
FROM
  public.events AS "events"
  LEFT JOIN public.users AS "users" ON "users".user_id = "events".user_id
WHERE
  (
    "events".timestamp >= $ 1 :: timestamptz
    AND "events".timestamp <= $ 2 :: timestamptz
  )
GROUP BY
  1
ORDER BY
  2 DESC
LIMIT
  10000

Any help would be hugely appreciated.

benskz avatar May 30 '25 09:05 benskz

@benskz isn't this the case for lambda pre aggregations? we are facing the same problem and this seems to be the implementation we need

willianba avatar Jun 05 '25 20:06 willianba

@benskz isn't this the case for lambda pre aggregations? we are facing the same problem and this seems to be the implementation we need

I believe lambda pre aggs are meant for combining pre aggregated data with new data from the same source. For example, you'd get past data from pre agg and then combine it with today's, real time, data.

We are wanting to combine a pre aggregated source with a non pre aggregated source. For example, combining pre aggregated event data with non pre aggregated user data.

benskz avatar Jun 07 '25 10:06 benskz

In this example, we'd want to query the number of clicks for each user and have the events side accelerated by the pre-aggregation.

@benskz Just add dimensions from the users cube to the pre-aggregation so that Cube can serve your queries fully from pre-aggregations.

In principle, combining pre-aggregated data and non-pre-aggregated data is a tough challenge since the former is stored and queried via Cube Store and the latter is stored and queries by your database.

igorlukanin avatar Jun 19 '25 18:06 igorlukanin

@igorlukanin thanks for the response.

The issue is that the users data could change quite frequently, in this case would you still recommend having it in the pre-agg? That would mean constantly rebuilding pre-aggs, or maybe we are just thinking about this wrong.

benskz avatar Jun 23 '25 14:06 benskz

Well, I believe, there's no singular right approach here.

One way is to include all necessary dimensions in the pre-aggregation and have it rebuilt as the data changes—you'll be basically throwing more compute into the problem.

Another would be to include only dimensions that do not change (e.g., user.id) and join the data on the application level. This consumes less compute but shifts the burden of joining the data downstream.

A third one would be to place a full-fledged query engine upstream of Cube (e.g., Trino), add something like Airflow to pre-aggregate data the way you want, and then consume it from Cube. Same, but upstream.

There might be more. Personally, I'd probably go with the first one and see how it works before implementing anything more complex. It can work quite well even if you have some slowly changing dimensions (e..g, user name)—if your measures are additive, you can deduplicate the data on the application side anyway.

igorlukanin avatar Jul 01 '25 14:07 igorlukanin

Thanks, we'll explore the different approaches and see what could work for us.

benskz avatar Jul 03 '25 09:07 benskz