links
links copied to clipboard
Proposals for extending query syntax
As we work towards adding support for deduplication ("SELECT DISTINCT"), I'd like to propose an extension to the query sublanguage syntax / desugaring.
In SQL one writes
SELECT DISTINCT record
FROM query1 t1,...
WHERE predicate
which looks like this in Links:
for(t1 <- query1,...,) where (predicate) [record]
Note that we have to put explicit list braces around the returned record, and that this means we are also allowed to return any other list of records. In constast, in SQL the syntax hard-wires the common case of returning just one record.
The most direct thing to do in Links is to make a function called something like "distinct" or "dedup" or (pace Haskell) nub
) with type [(a)] -e-> [(a)]
, perhaps also constraining a to be a row of things of base type (handling deduplication of nested records is possible, but somewhat painful.) When called inside a query, this function should have the effect of applying deduplication to the collection passed in as its argument.
Then we'd write
dedup(for (t1 <- query1, ...) where (pred) [record]
While there's nothing inherently wrong with this, it's a little ugly: the fact that records are being deduplicated is mentioned potentially far away from the place where we say what the records are. I'd like to do a little better.
I've brainstormed some (not all mutually exclusive) options:
- Add a keyword (e.g.
distinct
) that can be applied at any stage of a query, which has the effect of deduplicating the immediately enclosing for-comprehension.
for distinct (t1 <- query1, ...)
where (pred)
[record]
or
for (t1 <- query1, ...)
where (pred)
distinct [record]
- Allow the
returning
keyword as the final step in a query, so that in a query,returning x
is just syntactic sugar for[x]
.
for (t1 <- query1, ...)
where (pred)
returning record
- Variation on 1, if 2 is adopted: allow
distinct
ONLY (optionally) afterreturning
.
for (t1 <- query1, ...)
where (pred)
returning distinct record
I'd also be OK with adding a new return
keyword or changing returning
to return
(including in returning-insertions) - there isn't a particular reason to keep the SQL keyword if return
is clearer in both places. But that would be a breaking change.
Another alternative could be to use yields [distinct]
for return-a-singleton-list[-and-deduplicate] instead of return[ing]
.
Thoughts?
One complication I just noticed with proposals (1) and (3) is that in the presence of higher-order queries, it may not be immediately obvious where the distinct
function should go when desugaring a distinct
keyword, except in the special case of (1) where the distinct
keyword has to be immediately adjacent to the for
whose result is to be deduplicated. So maybe the SQL design choice is just not compatible with higher-order queries. I'll have to chew on this a bit more.
Additional thoughts:
In Links, an Iteration
expression is of the form:
for(gen1,...,genk)
where (pred)
orderby (orderspec)
exp
where there is a single for-clause which can have arbitrarily many generators, the where and orderby parts are optional,
In concert with #834, we could adjust this syntax to allow (optionally) "return [distinct] exp" syntax, where "return" says "the expression is a list element rather than a list", and "distinct" means in addition the whole iteration (starting with the preceding for
) is deduplicated.
This would mean that
for (x <- t1) for (y <- t2) return distinct (a=x.a,b=y.b)
would not be equivalent to
for (x <- t1,y <- t2) return distinct (a=x.a,b=y.b)
but this is a problem we already have, since the orderby
clause (if present) will also only apply to the inner for
. (Note that the first example gets parsed as two Iteration
s, the outer one just iterating over t1 and the inner one doing the rest.)
On balance, I think this is not too bad, provided we clearly document the difference between the above two forms. It is of course always still possible to write dedup
function calls explicitly to express different behavior.
So my concrete proposal would now be:
- Allow RETURNING and (optionally) DISTINCT followed by an expression of element type instead of an expression of list type as the last part of an
Iteration
. - If present, RETURNING means each iteration of the for yields a singleton described by the following expression.
- If present, DISTINCT indicates that (on translation to IR) the whole
Iteration
is enclosed in adedup
call.
We could also do without the RETURNING and just write DISTINCT exp
which means that even if exp returns multiple copies of the same value, they eventually get deduplicated. I find this potential usage somewhat counterintuitive though. I could also live with using YIELDS instead of RETURNING.
Another alternative would be to require DISTINCT to be immediately after the FOR keyword or after the generator list, which might make it clearer that DISTINCT only applies to the innermost FOR, but which I also find less intuitive (because DISTINCT is about the return value, not all the values of the variables bound by the generators)
but this is a problem we already have, since the
orderby
clause (if present) will also only apply to the innerfor
.
If I'm not mistaken, the orderby
clause is parsed together with the inner for
, but it is hoisted to the merged for
during normalization, which wouldn't be the case for distinct
in your proposal.
My feeling is that it'd be unnatural to return a set in the inner for
, and use it to build a bag in the outer for
. I like this syntax, but I would want the nested for example to behave exactly the same as the version with a single for
. Logically, this would be equivalent to associating distinct
with the outer for
; but confusingly, the argument of distinct
would still be associated with the inner for
.
Ah, interesting; I thought it was the other way around. Is this done at run-time (i.e. if there is an inner orderby in a function that constructs part of a query, is it hoisted out to apply to the top level?) This seems counterintuitive.
Yes. The relevant parts of normalization that deal with hoisting orderby clauses are here and here -- in particular, as you can easily see in the second link, reduce_for_body
merges the os
of the two fors.
Btw, I'm not advocating the use of the exact same technique we have in orderby
for distinct
: that wouldn't work because normalization pushes deduplication down to the leaves of the query, rather than hoisting it.
I'm not sure it is correct to hoist orderby
to the outermost enclosing for
. At least for non-database list comprehensions, this changes the results:
links> for (x <- [2,1]) for (y <- [2,1]) orderby (y) [(a=x,b=y)];
[(a = 2, b = 1), (a = 2, b = 2), (a = 1, b = 1), (a = 1, b = 2)] : [(a:Int,b:Int)]
links> for (x <- [2,1],y <- [2,1]) orderby (y) [(a=x,b=y)];
[(a = 2, b = 1), (a = 1, b = 1), (a = 2, b = 2), (a = 1, b = 2)] : [(a:Int,b:Int)]
though perhaps this is acceptable in a query given that in the database the order of traversing x
and y
is unspecified (and presumably, doing an orderby
inside another unordered comprehension can return the results in any order).
Is the semantics of orderby
in Links documented anywhere (other than the implementation)? The only mention in the documentation says:
The orderby clause on for comprehensions is used to sort the source before evaluating the body.
which is not quite what SQL does; I believe SQL evaluates the query and then sorts the result. But maybe that difference is the reason for my confusion. (It also isn't really clear what "sort the source" means if there are several tables/subqueries - does this mean sort the cross product of the inputs/subqueries? This could get expensive...)
It might alternatively make sense to consider distinct
to deduplicate the bindings seen so far instead of evaluating the query and deduplicating the result, but again that is not what SQL typically does.
I'm not sure it is correct to hoist
orderby
to the outermost enclosingfor
. At least for non-database list comprehensions, this changes the results:
Yeah, that's a good point. I can't be 100% sure by looking at my branch of Links only, but orderby
seems to have a different semantics on tables compared to lists.
It might alternatively make sense to consider
distinct
to deduplicate the bindings seen so far instead of evaluating the query and deduplicating the result, but again that is not what SQL typically does.
If you only deduplicate the bindings, your result could still have duplicates, e.g.
for (x <- [1,2], y <- [1,2]) distinct [x+y]
(presumably returning [1+1,1+2,2+1,2+2] = [2,3,3,4]
)
Interestingly, Kiselyov and Katsushima's APLAS 2018 paper presents a design for incorporating ORDER BY
and LIMIT/OFFSET
into comprehension query syntax and uses effects to track them. Orderig on fields l1,...,ln
has an effect o:[l1,...,ln]
and doing LIMIT
has effect l:(n,m)
where n is the number of records to return and m is the offset. If these constructs are present in a query then they are not performed immediately but collected/delayed until the query is complete (and normalized) and their effect is added on at the end. However, effects in subqueries (i.e. in generators) can be either disregarded (if ordering is applied and then immediately ignored) or ruled out.
- Performing
LIMIT
withoutORDER BY
is statically disallowed because it is nondeterministic (although I think nondeterminism is still possible, if the ordering fields don't include a key). - Subqueries performing
ORDER BY
but notLIMIT
are allowed, but the ordering is disregarded - there is no point in sorting a result inside a generator for an unordered query - Peforming
LIMIT
andORDER BY
in a generator subquery is disallowed by the effect system, because it is (supposedly) inefficient; instead alet table
construct (which translates to a materalizingWITH
common table expression) can be used, documenting that an expensive intermediate table construction may happen.
A similar approach could presumably be used for DISTINCT, where if the DISTINCT keyword is present then it is tracked as an effect, and applied to the query toplevel when the query is generated. I'm not sure this design would be sensible to incorporate into Links, though; it would certainly require some thought regarding how to incorporate the order/limit effect tracking into the rest of the type system consistently.
Nevertheless it's an interesting design and the paper gives an NBE-style implementation strategy that seems pretty close to what Links already does (including for the ordering part), possibly reinventing the implementation described in the appendix of our TLDI 2012 paper, which I'm assuming Oleg hadn't run across...
For reference, here is how distinct, grouping and ordering look in F#'s LINQ syntax (based on computation expressions):
for e in db.table do
...
where (...)
yield ...
distinct
This effectively desugars to something like query |> dedup
, i.e. dedup(query)
.
for x in db.table do
groupBy x.a into g
where (g.Key.HasValue && g.Key.Value > 10)
select (g.Key, g.Count())
This binds g to different groupings which are objects/records with a key and other operations such as count, which can also be considered as a collection (the collection of rows encountered so far with a given key value). The where clause after the groupBy tests the key value, and could also test aggregate values such as counts or sums of the given group, e.g. g.Count() > 1
. Other aggregations can be computed from the group, by embedding another query expression as follows:
for x in db.table do
groupBy x.a into g
let total =
query {
for y in g do
sumByNullable y.a
}
select (g.Key, g.Count(), total)
Though this looks like a nested loop, F# supposedly turns this into a single idiomatic group-by SQL query, however I don't know what algorithm is used or if it is complete. The presence of an additional collection-type variable g
which is not a base table seems like a potential obstacle to proving the standard normalization results.
Finally for ordering, this seems to be handled similarly to grouping:
for x in db.table do
sortBy x.n
select student
My impression is that this should be performed as a postprocessing step after the result is produced, rather than ordering the "valuations" generated by the for/where (but maybe it doesn't matter) but I'm not sure how this is translated down to LINQ interface calls or how these are compiled into SQL.
Finally here is a query combining both grouping and ordering:
for x in db.table do
groupBy x.a into g
where (g.Count() > 1)
sortByDescending (g.Count())
select (g.Key, g.Count())
I don't know what happens if the sorting is done before the grouping.