cube icon indicating copy to clipboard operation
cube copied to clipboard

Using FILTER_PARAMS causes filters to be duplicated in WHERE clauses

Open tchell opened this issue 3 years ago β€’ 9 comments

Describe the bug When using FILTER_PARAMS to filter a cube, the attribute that is being filtered gets added to the SQL twice.

To Reproduce

// cube.js
module.exports = {
  queryRewrite: (query, { ... }) => {
    query.filters.push({
      member: 'Order.closed',
      operator: 'equals',
      values: ['true'],
    });
    return query;
  }
};

// schema/Order.js
cube('Order, {
  sql: `SELECT * FROM order WHERE ${FILTER_PARAMS.Order.closed.filter('closed')}`,
  measures: {
    amount: {
      sql: 'amount',
      type: `sum`,
    },
  dimensions: {
    closed: {
      type: `boolean`,
      sql: `closed`,
   },
};

// query
{
  measures: 'Order.amount',
}

Generates the following sql:

SELECT sum(order.amount)
FROM
  (
    SELECT *
    FROM order
    WHERE order.closed = $1
  )
WHERE
  order.closed = $2

Expected behavior

SELECT sum(order.amount
    FROM order
    WHERE order.closed = $1

OR (at least)

SELECT sum(order.amount)
FROM
  (
    SELECT *
    FROM order
    WHERE order.closed = $1
  )

Version: "@cubejs-backend/postgres-driver": "^0.28.19", "@cubejs-backend/server": "^0.28.18", "@cubejs-backend/server-core": "^0.28.18"

tchell avatar Sep 22 '21 20:09 tchell

@tchell Hey Tanner! Could you please elaborate on your use case? It seems the example is oversimplified here.

paveltiunov avatar Sep 25 '21 06:09 paveltiunov

OK. I'm not totally sure what more there is to expand on. I'm pretty sure my example is following almost exactly how the docs indicates. And it seems like the sql generation is incorrect, since the filter is being done twice. Although the example provided is not exactly my use case I think it is sufficient to explain the error is it not?

My use case is that I'm trying to filter a cube by first joining with another that identifies if access is allowed to a given item in the cube:

cube('Order', {
  sql: `SELECT order.*, user.email as user_email
          FROM order JOIN user ON order.user_id = user.id
          WHERE ${FILTER_PARAMS.Order.user_email.filter('user.email')}`,
  ...
  dimensions: {
    user_email: {
      type: `string`,
      sql: `user_email`,
    },
  ....
}

// query
{
  measures: 'Order.amount',
  filters: [
    {
      member: 'Order.user_email',
      operator: 'equals',
      values: ['[email protected]'],
    },
}

In this case the SQL generated looks like:

SELECT sum(order_filtered.amount)
FROM
  (
    SELECT order.*, user.email as user_email
    FROM order ON order.user_id = user.id
    WHERE user.email = $1
  ) as order_filtered
WHERE order_filtered.user_email = $2

So the filter is being applied to the query twice when it should ideally only be used once. Is this clear enough? Sorry if not, I'm honestly not really sure what more information you need here.

tchell avatar Sep 27 '21 18:09 tchell

@tchell Hey Tanner! This join detail makes much more sense. And why this outer filter bothers you? Is it because of performance concerns?

paveltiunov avatar Sep 28 '21 01:09 paveltiunov

Yeah, realistically it shouldn't have any different results. Likely the only problem would be performance related (if that even). But to me it just seems like an odd thing that would ideally be removed since it's unnecessary and confusing when you see it.

tchell avatar Sep 28 '21 19:09 tchell

@tchell That makes sense. It's for sure redundant but most databases just push down it and execute only once. For those which doesn't you most probably should use pre-aggregations if performance is a concern. We also have an experimental rewriteQueries feature that parses SQL and pushes down filters by rewriting SQL. It can be extended to meet this requirement of not having duplicated filter.

paveltiunov avatar Oct 21 '21 06:10 paveltiunov

If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.

github-actions[bot] avatar Oct 21 '21 06:10 github-actions[bot]

Just wanted to follow up with something new that might make this a bit more important. If are joining the cube with another this will cause potentially undesirable behaviour. Every join in cubejs is done with a left join to allow null values to be included in the results. However, this behaviour does not allow that to happen because the second added clause will remove any null values. I'd argue this is undesirable because it seems like the point of using FILTER_PARAMS in the cube sql definition is to only filter the underlying cube before any joins and then the normal query behaviour should I can try working on this if I can find some time. But if anyone reading this happens to know how to discover what the FILTER_PARAMS are I'd appreciate it, since in my initial look through I haven't seen the obvious way yet...

tchell avatar Mar 25 '22 19:03 tchell

Related: https://github.com/cube-js/cube.js/issues/640

tomsseisums avatar Dec 08 '22 20:12 tomsseisums

Hi. Is there any method in cube configuration to remove the duplicate where statement as stated in this issue? Facing the same duplicate where clause statement as well.

absoluteharam avatar Jan 19 '24 04:01 absoluteharam

Is there currently a way to remove these duplicate clauses and optimize the query?

ChrisLahaye avatar Feb 20 '24 09:02 ChrisLahaye

@ChrisLahaye Are you sure that the generated query is not optimal? Could you please provide more details?

As @paveltiunov mentioned in his comment above, most databases and data warehouses push the outer filter down and perform filtering only onceβ€”so, there's no actual performance hit or "query deoptimization" happening.

igorlukanin avatar Feb 20 '24 10:02 igorlukanin

Hey @igorlukanin, we are using Cubejs on top of Clickhouse which has a postgresql connection in it(in the ex, it's the facts db). To optimize the Cubes that has static dimensions and metrics which our dashboard tables uses, we had written something like this;

SELECT
  "apm".*,
  "stats".*,
  ${mockTime(FILTER_PARAMS, 'advertiserProductMetrics')} as time
  FROM facts.advertiser_product_metrics "apm"
    LEFT JOIN (
      ${advertiserProductMetricSubquery(
        FILTER_PARAMS,
        'advertiserProductMetrics',
        ['campaign_id', 'product_resource_id']
      )}
    ) AS "stats" ON toInt64 ("apm".campaign_id) = toInt64 ("stats".campaign_id) AND "apm".resource_id = "stats".product_resource_id

There is this line in the advertiserProductMetricSubquery

WHERE ${filter_params[viewName].time.filter('time')}

And this helps us filter the events in the Clickhouse table before they are joined with postgresql tables. When using subqueries like this, pushdown from Clickhouse side is not really optimal. But to make that work we had to add this mockTime function which always needs to yield to true for the given time filter on Clickhouse. So we wrote this;

exports.mockTime = (filter_params, viewName) => {
  if (filter_params[viewName].time.filter('time') == '1 = 1') {
    return `1`
  }
  return `addSeconds(${filter_params[viewName].time.filter(
    (from, to) => 'parseDateTimeBestEffort(' + from + ')'
  )}, 1)`
}

Which is broken with the following commit; https://github.com/cube-js/cube/commit/736070d5a33c25d212420a61da6c98a4b4d31188

So if there was an option to disable duplicated WHERE clause, there would be no need to add such a hideous function πŸ˜…

casab avatar Feb 21 '24 09:02 casab

Is there currently a way to remove these duplicate clauses, in my case

select 
        accountId,
        count(distinct uuid) interactions,
        toDate(${SQL_UTILS.convertTz('created_datetime')}) createdDate
        from my_table
        WHERE
            ${FILTER_PARAMS.Dataset.createdDatetime.filter(
                'created_datetime',
            )} AND
            ${FILTER_PARAMS.Dataset.eventType.filter(
                'event_type',
            )} AND
            ${FILTER_PARAMS.Dataset.channel.filter('channel')} 
            
        group by 
            accountId, createdDate

In this case I really dont want to apply the second filter, because there is no such columns exists after the group by that I extended from the other cube.

sufiyangorgias avatar Feb 23 '24 15:02 sufiyangorgias

@sufiyangorgias The way I do it is to wrap this in a subquery, and create mock columns that will always yield the duplicated filters to TRUE. For example you can write something like

SELECT
    sq.*,
    ${FILTER_PARAMS.Dataset.eventType.filter('event_type')} as event_type,
    ${FILTER_PARAMS.Dataset.createdDatetime.filter('created_datetime')} as created_datetime
    ${FILTER_PARAMS.Dataset.channel.filter('channel')}  as channel
FROM
    (your query as subquery) as sq

But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that convertTz.

casab avatar Feb 25 '24 13:02 casab

Hey @casab Haha, that's a clever workaround! Thanks a bunch for the tip, I'll be sure to give it a try. You're a real SQL ninja! πŸ˜„πŸ™Œ.

Regarding this

But cubejs receives timezone, and converts the date and filters to relevant timezone. I don't think you need to use that convertTz.

I can't remove the timezone conversion because it affects the date's accuracy due to my setup. However, I'm interested in preserving the granularity feature provided by the time dimension filters. Do you know Is there a way to bypass the parent timezone conversion while still utilizing these time dimension filters? Your insights would be greatly appreciated.

sufiyangorgias avatar Feb 25 '24 16:02 sufiyangorgias

Is there any news or plans on this? At least give us the ability to delete it from filters after use.

casab avatar Mar 19 '24 14:03 casab