prql
prql copied to clipboard
Simpler grouping without aggregation?
This is currently invalid PRQL
from employees
group [city]
But the following is valid
from employees
group [city] (
aggregate []
)
Can we make a group without aggregation valid PRQL? Possibly even making the above example the default pipeline if none is specified?
I realise you can achieve the same result via
from employees
select city
group [city] (
take 1
)
but that is more clunky.
That being said an empty aggregate list has a few edges cases that need resolving (fixing the underlying or an error message disallowing it).
All of the below examples generate invalid SQL:
from employees
group [city] (
aggregate []
take 1
)
from employees
group [city] (
aggregate []
take ..
)
from employees
group [city] (
aggregate []
sort true
)
I haven't though about this, but you are completely right: empty group should just group the rows.
But I don't agree that it should be equivalent to take 1. Do you have any justification for that?
I would rather make your first example equivalent to:
from employees
sort hash(city)
I'm using hash here just to say "the order of cities is not important, just make sure that same cities are together". In SQL, we could just identity for the hash and translate to:
SELECT * FROM employees ORDER BY city
@aljazerzen Having thought about it some more, I think you are right.
The clunkiness of my last example won't be so bad when #944 is resolved and it can just be
from employees
group [city] (
take 1
)
The main benefit of defaulting to 'take 1' by default is making group and SQL's GROUP BY more consistent with each-other. However, nothing prevents someone bolting on a group_by keyword far far down the line with a different default - so proceeding as you describe gets a 👍 from me.
Actually, after thinking a bit more about this, I believe that:
-
this should be an error:
from employees | group [city]... saying
missing argument for function `group` -
this should be an error:
from employees | group [city] ()... saying
expected a `function relation -> relation`, got `null`... or evensyntax error: expected an expression found () -
this should group rows by
city:func id rel -> rel from employees | group [city] id
- this should be an error:
from employees | group [city]
Agreed. Unless we wanted to translate this into the following SQL but I'm not sure whether you can make group take zero or one transforms as an argument:
SELECT DISTINCT city FROM employees
- this should group rows by
city:func id rel -> rel from employees | group [city] id
I'm not sure what "group rows by city" means in this context.
On a purely logical/theoretical level, the group function splits the dataset by city and then the argument transform gets to process a data subset of just all records for a particular city. I think of it in the way the data is processed by for example pandas, which is well described here Group by: split-apply-combine.
So if you then apply the identity transform to that group data then you would just get back the original dataset, perhaps with the records reordered so that all records for the same city are contiguous. Therefore I think this should possibly translate to the following SQL:
SELECT *
FROM employees
ORDER BY city
- this should be an error:
from employees | group [city] ()
I would argue that the empty transform () should be the same as the identity transform func id rel -> rel in the sense that each pipeline is followed by an empty transform which leaves the dataset unchanged.
Therefore this should produce the same output as the previous case.
Trying some of these out in the playground I found a number of errors. Should I report these under #944 or create a separate issue?
I'm not sure what "group rows by city" means in this context.
Reorder rows such that rows with the same cities are next to each other.
I would argue that the empty transform () should be the same as the identity transform
I'd like this to be true for this case, but produces really weird behavior:
() # this identity i.e. func a -> a
(x) # this is just value x
(x | y) # this is function call (y x)
Generalizing single element pipelines to a plain value makes sense: there is no functions to apply this value to, so it's just the value.
I cannot come up with a justification like that for zero-element pipeline, so I'd rather leave this a syntax error. Things that are probably a mistake should ideally be forbidden.
Trying some of these out in the playground I found a number of errors. Should I report these under https://github.com/PRQL/prql/issues/944 or create a separate issue?
Whichever is easier.