haskell-opaleye
haskell-opaleye copied to clipboard
Fix bug preventing `orderedAggregate` and `distinctAggregator` from being used together (alternative approach)
(This is an alternative fix to the bug described in #577 that doesn't require adding Eq
and Ord
constraints to PrimExpr
.)
This PR contains two commits.
The first commit adds a (failing) test case demonstrating the inability to combine orderedAggregate
and distinctAggregator
together, even in the limited cases allowed by PostgreSQL. The problem is that PostgreSQL only allows this if the expressions given in the ORDER BY
clause match the expressions given as arguments to the aggregation function.
The second commit fixes this bug. It does so by adding Eq
and Ord
constraints to Symbol
(but not PrimExpr
) and using them to detect identical references anywhere within the PrimExpr
s contained with the Aggregate
. Instead of renaming the entire expressions, we only rename references to fields (which could possibly be lateral references, though this isn't checked). We maintain a Map
of already-renamed symbols such that identical references will be renamed to the same symbol, fulfilling PostgreSQL's restriction.
This is based on the refactoring in #575.
I think this is my preferred approach but I won't be able to get to it immediately.
Hi @shane-circuithub, please ping me if this is is urgent for you. I'll get to it in due course but I'm short of spare mental cycles at the moment.
It's not necessarily urgent that you merge this, because I can just pin to this commit for now, but I am actually using this. I didn't realise when I first wrote this, but this is actually needed for ordered-set aggregation functions to work. Take the following query for example:
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY column1)
FROM (
VALUES
(1),
(2),
(3),
(4)
) _;
percentile_cont
-----------------
2.5
(1 row)
If you try to recreate that with Opaleye (without this commit), it will produce the following (paraphrased):
SELECT
percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
SELECT
0.5 AS inner_1,
column1 AS inner_2
FROM (
VALUES
(1),
(2),
(3),
(4)
) _
) _;
ERROR: column "_.inner_1" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2: percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
^
DETAIL: Direct arguments of an ordered-set aggregate must use only grouped columns.
With this PR, this becomes:
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY inner_1)
FROM (
SELECT
column1 AS inner_1
FROM (
VALUES
(1),
(2),
(3),
(4)
) _
) _;
percentile_cont
-----------------
2.5
(1 row)
With this PR, extractAggregateFields
only rewrites references (e.g., column1
to inner_1
), it doesn't rewrite non-reference expressions (such as 0.5
), which allows Opaleye produces valid SQL for ordered set aggregation functions. This is why I closed #577 in favour if this PR, because #577 wouldn't solve this problem.
The other "valid" way to construct ordered set aggregation functions (which admittedly I care much less about supporting) is with the direct argument in a GROUP BY
clause, e.g.:
SELECT
percentile_cont(column1) WITHIN GROUP (ORDER BY column2)
FROM (
VALUES
(0.5, 1),
(0.5, 2),
(0.5, 3),
(0.5, 4),
(0.8, 5),
(0.8, 6),
(0.8, 7),
(0.8, 8)
) _
GROUP BY
column1;
Without this PR, if you try to write this in Opaleye it will produce the following SQL (again paraphrased):
SELECT
percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
FROM (
SELECT
column1 AS inner_1,
column1 AS inner_2,
column2 AS inner_3
FROM (
VALUES
(0.5, 1),
(0.5, 2),
(0.5, 3),
(0.5, 4),
(0.8, 5),
(0.8, 6),
(0.8, 7),
(0.8, 8)
) _
) _
GROUP BY
inner_1;
ERROR: column "_.inner_2" must appear in the GROUP BY clause or be used in an aggregate function
LINE 2: percentile_cont(inner_2) WITHIN GROUP (ORDER BY inner_3)
^
DETAIL: Direct arguments of an ordered-set aggregate must use only grouped columns.
Even though inner_1
and inner_2
refer to the same underlying column1
, Postgres isn't smart enough to detect that, it wants the exact same expression in the GROUP BY
clause and the direct argument. However, with this PR, it becomes:
SELECT
percentile_cont(inner_1) WITHIN GROUP (ORDER BY inner_2)
FROM (
SELECT
column1 AS inner_1,
column2 AS inner_2
FROM (
VALUES
(0.5, 1),
(0.5, 2),
(0.5, 3),
(0.5, 4),
(0.8, 5),
(0.8, 6),
(0.8, 7),
(0.8, 8)
) _
) _
GROUP BY
inner_1;
percentile_cont
-----------------
2.5
7.4
Because this PR rewrites only references, and because it keeps track of already-rewritten references and makes sure to only to rename each reference once, both the GROUP BY
and the direct argument to percentile_cont
end up with the same inner_1
expression.
This is not to put pressure on you to deal with this right now if you don't have the capacity (again I'm happy to just pin to this commit in the short-term), but it just explains an additional motivation that I wasn't aware of when I first wrote this.
The first commit adds a (failing) test
Could you double check? That test passes here.
EDIT: No, it fails here (as expected)!
Hmm, and if fails in CI. Maybe a difference between Postgres versions? I'm using 13 locally and CI is using 11.
https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587659165/job/17898358429#step:17:627
Hmm, no it fails in CI in 13 too. I'm really confused why it doesn't fail for me locally!
https://github.com/tomjaguarpaw/haskell-opaleye/actions/runs/6587744501/job/17898614746#step:17:631
No, it does fail locally. Not sure what I was thinking. Never mind.
The error is "in an aggregate with DISTINCT, ORDER BY expressions must appear in argument list", so I think I have to understand that first.
I think we have a bigger problem, because if I slightly change your test case then your fix is no longer sufficient. I appreciate that your fix is enough to make it work with well-formed examples that should work but I'd much rather find a fix that doesn't allow generating incorrect SQL.
@ Test/Test.hs:626 @ testStringArrayAggregateOrderedDistinct :: Test
testStringArrayAggregateOrderedDistinct = it "" $ q `selectShouldReturnSorted` expected
where q =
O.aggregateOrdered
- (O.asc snd)
+ (O.asc (\x -> snd x O..++ snd x))
(PP.p2 (O.arrayAgg, O.distinctAggregator . O.stringAgg . O.sqlString $ ","))
table7Q
expected = [ ( map fst sortedData
@ Test/Test.hs:1506 @ main = do
I can now understand the aggregator DISTINCT
issue (but not the WITHIN GROUP
one). If you have an aggregation of the following form then the expressions p
, q
, r
must each occur literally within the set {a,b,c,...}
. That's a pretty annoying restriction! And it will be hard to enforce statically, but that's how I'd like to solve this particular issue. I'll have more of a think.
FROM (SELECT
AGGREGATOR(DISTINCT a,b,c,... ORDER BY p,q,r, ...) as "result3_3"
I think the correct API for this is probably something like
makeAggregator ::
AggregatorFunction w a b ->
Order a ->
Aggregator a b
makeDistinctAggregator ::
AggregatorFunction w a b ->
Order w ->
Aggregator a b
where the AggregatorFunction
s are like
stringAgg ::
AggregatorFunction
(Wrap (Field SqlText), Wrap (Field SqlText))
(Field SqlText, Field SqlText)
(Field SqlText)
and Wrap
is an opaque wrapper that allows you to reorder Field
s, but doesn't allow you to apply any Field
functions to them. That is, the w
parameter to the AggregatorFunction
is the exact arguments supplied to the SQL aggregation function. So for example, we would have
wrap :: Field a -> Wrap (Field a)
ascWrap :: Order (Wrap (Field a))
asc :: Order (Field a)
asc = contramap wrap ascWrap
I think this would work. It's a bit heavyweight, but actually resolves my preexisting concerns about the semantics of orderAggregate
, aggregateOrdered
and distinctAggregator
.
I've no idea whether the same approach would work for the WITHIN GROUP
problem.