haskell-opaleye Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach)

(This is an alternative fix to the bug described in #577 that doesn't require adding Eq and Ord constraints to PrimExpr.)

This PR contains two commits.

The first commit adds a (failing) test case demonstrating the inability to combine orderedAggregate and distinctAggregator together, even in the limited cases allowed by PostgreSQL. The problem is that PostgreSQL only allows this if the expressions given in the ORDER BY clause match the expressions given as arguments to the aggregation function.

The second commit fixes this bug. It does so by adding Eq and Ord constraints to Symbol (but not PrimExpr) and using them to detect identical references anywhere within the PrimExprs contained with the Aggregate. Instead of renaming the entire expressions, we only rename references to fields (which could possibly be lateral references, though this isn't checked). We maintain a Map of already-renamed symbols such that identical references will be renamed to the same symbol, fulfilling PostgreSQL's restriction.

Oct 05 '23 16:10 shane-circuithub

This is based on the refactoring in #575.

Oct 05 '23 16:10 shane-circuithub

I think this is my preferred approach but I won't be able to get to it immediately.

Oct 05 '23 19:10 tomjaguarpaw

Hi @shane-circuithub, please ping me if this is is urgent for you. I'll get to it in due course but I'm short of spare mental cycles at the moment.

Oct 14 '23 17:10 tomjaguarpaw

It's not necessarily urgent that you merge this, because I can just pin to this commit for now, but I am actually using this. I didn't realise when I first wrote this, but this is actually needed for ordered-set aggregation functions to work. Take the following query for example:

SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY column1)
FROM (
  VALUES
    (1),
    (2),
    (3),
    (4)
) _;

 percentile_cont 
-----------------
             2.5
(1 row)

If you try to recreate that with Opaleye (without this commit), it will produce the following (paraphrased):

SELECT
  percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
  SELECT
    0.5 AS inner_1,
    column1 AS inner_2
  FROM (
    VALUES
      (1),
      (2),
      (3),
      (4)
  ) _
) _;

ERROR:  column "_.inner_1" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2:   percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
                          ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

With this PR, this becomes:

SELECT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY inner_1)
FROM (
  SELECT
    column1 AS inner_1
  FROM (
    VALUES
      (1),
      (2),
      (3),
      (4)
  ) _
) _;

 percentile_cont 
-----------------
             2.5
(1 row)

With this PR, extractAggregateFields only rewrites references (e.g., column1 to inner_1), it doesn't rewrite non-reference expressions (such as 0.5), which allows Opaleye produces valid SQL for ordered set aggregation functions. This is why I closed #577 in favour if this PR, because #577 wouldn't solve this problem.

The other "valid" way to construct ordered set aggregation functions (which admittedly I care much less about supporting) is with the direct argument in a GROUP BY clause, e.g.:

SELECT
  percentile_cont(column1) WITHIN GROUP (ORDER BY column2)
FROM (
  VALUES
    (0.5, 1),
    (0.5, 2),
    (0.5, 3),
    (0.5, 4),
    (0.8, 5),
    (0.8, 6),
    (0.8, 7),
    (0.8, 8)
) _
GROUP BY
  column1;

Without this PR, if you try to write this in Opaleye it will produce the following SQL (again paraphrased):

SELECT
  percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
FROM (
  SELECT
    column1 AS inner_1,
    column1 AS inner_2,
    column2 AS inner_3
  FROM (
    VALUES
      (0.5, 1),
      (0.5, 2),
      (0.5, 3),
      (0.5, 4),
      (0.8, 5),
      (0.8, 6),
      (0.8, 7),
      (0.8, 8)
  ) _
) _
GROUP BY
  inner_1;

ERROR:  column "_.inner_2" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2:   percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
                          ^
DETAIL:  Direct arguments of an ordered-set aggregate must use only grouped columns.

Even though inner_1 and inner_2 refer to the same underlying column1, Postgres isn't smart enough to detect that, it wants the exact same expression in the GROUP BY clause and the direct argument. However, with this PR, it becomes:

SELECT
  percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
  SELECT
    column1 AS inner_1,
    column2 AS inner_2
  FROM (
    VALUES
      (0.5, 1),
      (0.5, 2),
      (0.5, 3),
      (0.5, 4),
      (0.8, 5),
      (0.8, 6),
      (0.8, 7),
      (0.8, 8)
  ) _
) _
GROUP BY
  inner_1;

 percentile_cont 
-----------------
             2.5
             7.4

Because this PR rewrites only references, and because it keeps track of already-rewritten references and makes sure to only to rename each reference once, both the GROUP BY and the direct argument to percentile_cont end up with the same inner_1 expression.

This is not to put pressure on you to deal with this right now if you don't have the capacity (again I'm happy to just pin to this commit in the short-term), but it just explains an additional motivation that I wasn't aware of when I first wrote this.

Oct 16 '23 07:10 shane-circuithub

The first commit adds a (failing) test

Could you double check? That test passes here.

EDIT: No, it fails here (as expected)!

Oct 20 '23 12:10 tomjaguarpaw

Hmm, and if fails in CI. Maybe a difference between Postgres versions? I'm using 13 locally and CI is using 11.

https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587659165/job/17898358429#step:17:627

Oct 20 '23 12:10 tomjaguarpaw

Hmm, no it fails in CI in 13 too. I'm really confused why it doesn't fail for me locally!

https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587744501/job/17898614746#step:17:631

Oct 20 '23 12:10 tomjaguarpaw

No, it does fail locally. Not sure what I was thinking. Never mind.

Oct 20 '23 12:10 tomjaguarpaw

The error is "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list", so I think I have to understand that first.

Oct 20 '23 12:10 tomjaguarpaw

I think we have a bigger problem, because if I slightly change your test case then your fix is no longer sufficient. I appreciate that your fix is enough to make it work with well-formed examples that should work but I'd much rather find a fix that doesn't allow generating incorrect SQL.

@ Test/Test.hs:626 @ testStringArrayAggregateOrderedDistinct :: Test
 testStringArrayAggregateOrderedDistinct = it "" $ q `selectShouldReturnSorted` expected
   where q =
           O.aggregateOrdered
-            (O.asc snd)
+            (O.asc (\x -> snd x O..++ snd x))
             (PP.p2 (O.arrayAgg, O.distinctAggregator . O.stringAgg . O.sqlString $ ","))
             table7Q
         expected = [ ( map fst sortedData
@ Test/Test.hs:1506 @ main = do

Oct 27 '23 09:10 tomjaguarpaw

I can now understand the aggregator DISTINCT issue (but not the WITHIN GROUP one). If you have an aggregation of the following form then the expressions p, q, r must each occur literally within the set {a,b,c,...}. That's a pretty annoying restriction! And it will be hard to enforce statically, but that's how I'd like to solve this particular issue. I'll have more of a think.

FROM (SELECT
     AGGREGATOR(DISTINCT a,b,c,... ORDER BY p,q,r, ...) as "result3_3"

Oct 27 '23 10:10 tomjaguarpaw

I think the correct API for this is probably something like

makeAggregator ::
  AggregatorFunction w a b ->
  Order a ->
  Aggregator a b

makeDistinctAggregator ::
  AggregatorFunction w a b ->
  Order w ->
  Aggregator a b

where the AggregatorFunctions are like

stringAgg ::
  AggregatorFunction
    (Wrap (Field SqlText), Wrap (Field SqlText))
    (Field SqlText, Field SqlText)
    (Field SqlText)

and Wrap is an opaque wrapper that allows you to reorder Fields, but doesn't allow you to apply any Field functions to them. That is, the w parameter to the AggregatorFunction is the exact arguments supplied to the SQL aggregation function. So for example, we would have

wrap :: Field a -> Wrap (Field a)

ascWrap :: Order (Wrap (Field a))

asc :: Order (Field a)
asc = contramap wrap ascWrap

I think this would work. It's a bit heavyweight, but actually resolves my preexisting concerns about the semantics of orderAggregate, aggregateOrdered and distinctAggregator.

I've no idea whether the same approach would work for the WITHIN GROUP problem.

Oct 27 '23 11:10 tomjaguarpaw

haskell-opaleye haskell-opaleye copied to clipboard

Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach)

haskell-opaleye
haskell-opaleye copied to clipboard