prql icon indicating copy to clipboard operation
prql copied to clipboard

`any` & `all` functions

Open strokirk opened this issue 2 years ago • 5 comments

It would be super useful to have some sort of functionality which would allow us to combine multiple OR queries more easily, especially allowing them to be formatted into multiple lines.

Here is a simple suggestion for how such a couple of helpers look:

filter any [
  foo == 1,
  all [
    bar == 4,
    any [
       xuul == 2,
       foo == 2,
    ],
  ],
]

which would compile to

SELECT id
FROM tbl
WHERE foo = 1 OR (bar = 4 AND (xuul = 2 OR foo = 2))

strokirk avatar Feb 24 '23 22:02 strokirk

These would be functions added to the compiler any & all which take a list. Because they iterate over the list, they do require to be in the compiler. This would be a great early contribution, but would be an ambitious initial contribution.

(Though if we enabled jinja in s-strings, which is probably a terrible idea, then this could be in s-strings...)

(I'll adjust the title if that's OK @strokirk )

max-sixty avatar Feb 24 '23 22:02 max-sixty

@max-sixty please, edit away! :D

The key points I find important are:

  • Each line should be able to be commented out easily
  • Nesting is needed
  • Preferably a consistent combination character, for example trailing comma

strokirk avatar Feb 24 '23 23:02 strokirk

The key points I find important are:

Yes great, we'd get those things by using a normal PRQL list. The work here is to add the any & all functions which take a list

max-sixty avatar Feb 25 '23 00:02 max-sixty

@aljazerzen I see the "work in progree" label. Are you still working on it?

PrettyWood avatar Dec 08 '23 08:12 PrettyWood

I'm not.

This was partially implemented in #2940 (and a followup PR), so the following is now supported:

from customers
derive has_fax = (fax != null)
group {country, city} (
  aggregate {
    is_ancient = all has_fax
  }
)
group {country} (
  aggregate {
    has_ancient_city = (any is_ancient)
  }
)
aggregate {
  do_all_countries_have_ancient_cities = (all has_ancient_city)
}

std.all and std.any are aggregation functions that expect an array of bools. But they currently don't work with array literals.

I see a few ways of implementing this:

  1. Make a special case backend implementation for std.all and std.any, where we look if the argument is an array literal and unpack the literal into OR and AND, as proposed by the original comment.

    Would work, but this approach is hard to scale onto all other aggregation functions.

  2. Materialize RQ arrays into SQL arrays: (ARRAY [] or equivalent in other dialects) The problem is that some dialects don't support arrays at all (SQLite3) and that even those that do, don't have an easy way to implement any([false, true, false]).

    On Postgres, it looks like this: select TRUE = ANY (ARRAY[TRUE, FALSE, TRUE]);

  3. Materialize RQ arrays into SQL relations with a single column. For example: [false, true, false] would compile into:

    (SELECT FALSE UNION ALL SELECT TRUE UNION ALL SELECT FALSE)
    

    The idea here is that this is now a plain relation, that we can use with normal aggregation functions. We'd define the column name as some fixed string i.e. arr_col:

    SELECT ...
    FROM ...
    WHERE (
      SELECT BOOL_OR(arr_col) FROM (...the query from above...) arr(arr_col)
    )
    

    This is a general solution, as it would also work with:

    • things that are not array literals, but do have the array type (a column that has been aggregated with array_agg):
      from customers
      group {country} (
        aggregate {
          customer_ids = array_agg customer_id
        }
      )
      select {
        country,
        sum_of_ids = sum customer_ids
      }
      
    • other aggregation functions, for example:
      filter (sum [1, my_col, 3]) > 10
      

If we go with the last option, this requires a bit more work upfront to define this process of array relational-ization. It think it would pay off, as we could also support arbitrary array operations:

from customers
aggregate {
    cnt_has_fax = (fax | filter this != null | count),
    cnt_names_that_start_with_a = (first_name | filter this[0] == 'A' | count)
}

aljazerzen avatar Dec 08 '23 09:12 aljazerzen