prql icon indicating copy to clipboard operation
prql copied to clipboard

Simpler grouping without aggregation?

Open BlurrechDev opened this issue 2 years ago • 7 comments

This is currently invalid PRQL

from employees
group [city]

But the following is valid

from employees
group [city] (
  aggregate []
)

Can we make a group without aggregation valid PRQL? Possibly even making the above example the default pipeline if none is specified?

I realise you can achieve the same result via

from employees
select city
group [city] (
  take 1
)

but that is more clunky.

BlurrechDev avatar Jan 07 '23 17:01 BlurrechDev

That being said an empty aggregate list has a few edges cases that need resolving (fixing the underlying or an error message disallowing it).

All of the below examples generate invalid SQL:

from employees
group [city] (
  aggregate []
  take 1
)
from employees
group [city] (
  aggregate []
  take ..
)
from employees
group [city] (
  aggregate []
  sort true
)

BlurrechDev avatar Jan 07 '23 17:01 BlurrechDev

I haven't though about this, but you are completely right: empty group should just group the rows.

But I don't agree that it should be equivalent to take 1. Do you have any justification for that?

I would rather make your first example equivalent to:

from employees
sort hash(city)

I'm using hash here just to say "the order of cities is not important, just make sure that same cities are together". In SQL, we could just identity for the hash and translate to:

SELECT * FROM employees ORDER BY city

aljazerzen avatar Jan 07 '23 17:01 aljazerzen

@aljazerzen Having thought about it some more, I think you are right.

The clunkiness of my last example won't be so bad when #944 is resolved and it can just be

from employees
group [city] (
  take 1
)

The main benefit of defaulting to 'take 1' by default is making group and SQL's GROUP BY more consistent with each-other. However, nothing prevents someone bolting on a group_by keyword far far down the line with a different default - so proceeding as you describe gets a 👍 from me.

BlurrechDev avatar Jan 07 '23 18:01 BlurrechDev

Actually, after thinking a bit more about this, I believe that:

  • this should be an error:

    from employees | group [city]
    

    ... saying missing argument for function `group`

  • this should be an error:

    from employees | group [city] ()
    

    ... saying expected a `function relation -> relation`, got `null` ... or even syntax error: expected an expression found ()

  • this should group rows by city:

    func id rel -> rel
    from employees | group [city] id
    

aljazerzen avatar Jan 09 '23 09:01 aljazerzen

  • this should be an error:
    from employees | group [city]
    

Agreed. Unless we wanted to translate this into the following SQL but I'm not sure whether you can make group take zero or one transforms as an argument:

SELECT DISTINCT city FROM employees

  • this should group rows by city:
    func id rel -> rel
    from employees | group [city] id
    

I'm not sure what "group rows by city" means in this context.

On a purely logical/theoretical level, the group function splits the dataset by city and then the argument transform gets to process a data subset of just all records for a particular city. I think of it in the way the data is processed by for example pandas, which is well described here Group by: split-apply-combine.

So if you then apply the identity transform to that group data then you would just get back the original dataset, perhaps with the records reordered so that all records for the same city are contiguous. Therefore I think this should possibly translate to the following SQL:

SELECT *
FROM employees
ORDER BY city

  • this should be an error:
    from employees | group [city] ()
    

I would argue that the empty transform () should be the same as the identity transform func id rel -> rel in the sense that each pipeline is followed by an empty transform which leaves the dataset unchanged.

Therefore this should produce the same output as the previous case.


Trying some of these out in the playground I found a number of errors. Should I report these under #944 or create a separate issue?

snth avatar Jan 09 '23 14:01 snth

I'm not sure what "group rows by city" means in this context.

Reorder rows such that rows with the same cities are next to each other.


I would argue that the empty transform () should be the same as the identity transform

I'd like this to be true for this case, but produces really weird behavior:

()      # this identity i.e. func a -> a
(x)     # this is just value x
(x | y) # this is function call (y x) 

Generalizing single element pipelines to a plain value makes sense: there is no functions to apply this value to, so it's just the value.

I cannot come up with a justification like that for zero-element pipeline, so I'd rather leave this a syntax error. Things that are probably a mistake should ideally be forbidden.

aljazerzen avatar Jan 09 '23 14:01 aljazerzen

Trying some of these out in the playground I found a number of errors. Should I report these under https://github.com/PRQL/prql/issues/944 or create a separate issue?

Whichever is easier.

aljazerzen avatar Jan 09 '23 14:01 aljazerzen