links icon indicating copy to clipboard operation
links copied to clipboard

Proposals for extending query syntax

Open jamescheney opened this issue 4 years ago • 10 comments

As we work towards adding support for deduplication ("SELECT DISTINCT"), I'd like to propose an extension to the query sublanguage syntax / desugaring.

In SQL one writes

SELECT DISTINCT record
FROM query1 t1,...
WHERE predicate

which looks like this in Links:

for(t1 <- query1,...,) where (predicate) [record]

Note that we have to put explicit list braces around the returned record, and that this means we are also allowed to return any other list of records. In constast, in SQL the syntax hard-wires the common case of returning just one record.

The most direct thing to do in Links is to make a function called something like "distinct" or "dedup" or (pace Haskell) nub) with type [(a)] -e-> [(a)], perhaps also constraining a to be a row of things of base type (handling deduplication of nested records is possible, but somewhat painful.) When called inside a query, this function should have the effect of applying deduplication to the collection passed in as its argument.

Then we'd write

dedup(for (t1 <- query1, ...) where (pred) [record]

While there's nothing inherently wrong with this, it's a little ugly: the fact that records are being deduplicated is mentioned potentially far away from the place where we say what the records are. I'd like to do a little better.

I've brainstormed some (not all mutually exclusive) options:

  1. Add a keyword (e.g. distinct) that can be applied at any stage of a query, which has the effect of deduplicating the immediately enclosing for-comprehension.
for distinct (t1 <- query1, ...) 
where (pred) 
[record]

or

for (t1 <- query1, ...) 
where (pred) 
distinct [record]
  1. Allow the returning keyword as the final step in a query, so that in a query,returning x is just syntactic sugar for [x].
for (t1 <- query1, ...) 
where (pred) 
returning record
  1. Variation on 1, if 2 is adopted: allow distinct ONLY (optionally) after returning.
for (t1 <- query1, ...) 
where (pred) 
returning distinct record

I'd also be OK with adding a new return keyword or changing returning to return (including in returning-insertions) - there isn't a particular reason to keep the SQL keyword if return is clearer in both places. But that would be a breaking change.

Another alternative could be to use yields [distinct] for return-a-singleton-list[-and-deduplicate] instead of return[ing].

Thoughts?

jamescheney avatar May 21 '20 18:05 jamescheney

One complication I just noticed with proposals (1) and (3) is that in the presence of higher-order queries, it may not be immediately obvious where the distinct function should go when desugaring a distinct keyword, except in the special case of (1) where the distinct keyword has to be immediately adjacent to the for whose result is to be deduplicated. So maybe the SQL design choice is just not compatible with higher-order queries. I'll have to chew on this a bit more.

jamescheney avatar May 21 '20 18:05 jamescheney

Additional thoughts:

In Links, an Iteration expression is of the form:

for(gen1,...,genk) 
where (pred)
orderby (orderspec)
exp

where there is a single for-clause which can have arbitrarily many generators, the where and orderby parts are optional,

In concert with #834, we could adjust this syntax to allow (optionally) "return [distinct] exp" syntax, where "return" says "the expression is a list element rather than a list", and "distinct" means in addition the whole iteration (starting with the preceding for) is deduplicated.

This would mean that

for (x <- t1) for (y <- t2) return distinct (a=x.a,b=y.b)

would not be equivalent to

for (x <- t1,y <- t2) return distinct (a=x.a,b=y.b)

but this is a problem we already have, since the orderby clause (if present) will also only apply to the inner for. (Note that the first example gets parsed as two Iterations, the outer one just iterating over t1 and the inner one doing the rest.)

On balance, I think this is not too bad, provided we clearly document the difference between the above two forms. It is of course always still possible to write dedup function calls explicitly to express different behavior.

So my concrete proposal would now be:

  • Allow RETURNING and (optionally) DISTINCT followed by an expression of element type instead of an expression of list type as the last part of an Iteration.
  • If present, RETURNING means each iteration of the for yields a singleton described by the following expression.
  • If present, DISTINCT indicates that (on translation to IR) the whole Iteration is enclosed in a dedup call.

We could also do without the RETURNING and just write DISTINCT exp which means that even if exp returns multiple copies of the same value, they eventually get deduplicated. I find this potential usage somewhat counterintuitive though. I could also live with using YIELDS instead of RETURNING.

Another alternative would be to require DISTINCT to be immediately after the FOR keyword or after the generator list, which might make it clearer that DISTINCT only applies to the innermost FOR, but which I also find less intuitive (because DISTINCT is about the return value, not all the values of the variables bound by the generators)

jamescheney avatar Jun 02 '20 15:06 jamescheney

but this is a problem we already have, since the orderby clause (if present) will also only apply to the inner for.

If I'm not mistaken, the orderby clause is parsed together with the inner for, but it is hoisted to the merged for during normalization, which wouldn't be the case for distinct in your proposal.

My feeling is that it'd be unnatural to return a set in the inner for, and use it to build a bag in the outer for. I like this syntax, but I would want the nested for example to behave exactly the same as the version with a single for. Logically, this would be equivalent to associating distinct with the outer for; but confusingly, the argument of distinct would still be associated with the inner for.

wricciot avatar Jun 03 '20 12:06 wricciot

Ah, interesting; I thought it was the other way around. Is this done at run-time (i.e. if there is an inner orderby in a function that constructs part of a query, is it hoisted out to apply to the top level?) This seems counterintuitive.

jamescheney avatar Jun 03 '20 12:06 jamescheney

Yes. The relevant parts of normalization that deal with hoisting orderby clauses are here and here -- in particular, as you can easily see in the second link, reduce_for_body merges the os of the two fors.

wricciot avatar Jun 03 '20 12:06 wricciot

Btw, I'm not advocating the use of the exact same technique we have in orderby for distinct: that wouldn't work because normalization pushes deduplication down to the leaves of the query, rather than hoisting it.

wricciot avatar Jun 03 '20 12:06 wricciot

I'm not sure it is correct to hoist orderby to the outermost enclosing for. At least for non-database list comprehensions, this changes the results:

links> for (x <- [2,1]) for (y <- [2,1]) orderby (y) [(a=x,b=y)];
[(a = 2, b = 1), (a = 2, b = 2), (a = 1, b = 1), (a = 1, b = 2)] : [(a:Int,b:Int)]
links> for (x <- [2,1],y <- [2,1]) orderby (y) [(a=x,b=y)];
[(a = 2, b = 1), (a = 1, b = 1), (a = 2, b = 2), (a = 1, b = 2)] : [(a:Int,b:Int)]

though perhaps this is acceptable in a query given that in the database the order of traversing x and y is unspecified (and presumably, doing an orderby inside another unordered comprehension can return the results in any order).

Is the semantics of orderby in Links documented anywhere (other than the implementation)? The only mention in the documentation says:

The orderby clause on for comprehensions is used to sort the source before evaluating the body.

which is not quite what SQL does; I believe SQL evaluates the query and then sorts the result. But maybe that difference is the reason for my confusion. (It also isn't really clear what "sort the source" means if there are several tables/subqueries - does this mean sort the cross product of the inputs/subqueries? This could get expensive...)

It might alternatively make sense to consider distinct to deduplicate the bindings seen so far instead of evaluating the query and deduplicating the result, but again that is not what SQL typically does.

jamescheney avatar Jun 04 '20 09:06 jamescheney

I'm not sure it is correct to hoist orderby to the outermost enclosing for. At least for non-database list comprehensions, this changes the results:

Yeah, that's a good point. I can't be 100% sure by looking at my branch of Links only, but orderby seems to have a different semantics on tables compared to lists.

It might alternatively make sense to consider distinct to deduplicate the bindings seen so far instead of evaluating the query and deduplicating the result, but again that is not what SQL typically does.

If you only deduplicate the bindings, your result could still have duplicates, e.g.

for (x <- [1,2], y <- [1,2]) distinct [x+y] (presumably returning [1+1,1+2,2+1,2+2] = [2,3,3,4])

wricciot avatar Jun 04 '20 11:06 wricciot

Interestingly, Kiselyov and Katsushima's APLAS 2018 paper presents a design for incorporating ORDER BY and LIMIT/OFFSET into comprehension query syntax and uses effects to track them. Orderig on fields l1,...,ln has an effect o:[l1,...,ln] and doing LIMIT has effect l:(n,m) where n is the number of records to return and m is the offset. If these constructs are present in a query then they are not performed immediately but collected/delayed until the query is complete (and normalized) and their effect is added on at the end. However, effects in subqueries (i.e. in generators) can be either disregarded (if ordering is applied and then immediately ignored) or ruled out.

  • Performing LIMIT without ORDER BY is statically disallowed because it is nondeterministic (although I think nondeterminism is still possible, if the ordering fields don't include a key).
  • Subqueries performing ORDER BY but not LIMIT are allowed, but the ordering is disregarded - there is no point in sorting a result inside a generator for an unordered query
  • Peforming LIMIT and ORDER BY in a generator subquery is disallowed by the effect system, because it is (supposedly) inefficient; instead a let table construct (which translates to a materalizing WITH common table expression) can be used, documenting that an expensive intermediate table construction may happen.

A similar approach could presumably be used for DISTINCT, where if the DISTINCT keyword is present then it is tracked as an effect, and applied to the query toplevel when the query is generated. I'm not sure this design would be sensible to incorporate into Links, though; it would certainly require some thought regarding how to incorporate the order/limit effect tracking into the rest of the type system consistently.

Nevertheless it's an interesting design and the paper gives an NBE-style implementation strategy that seems pretty close to what Links already does (including for the ordering part), possibly reinventing the implementation described in the appendix of our TLDI 2012 paper, which I'm assuming Oleg hadn't run across...

jamescheney avatar Jun 18 '20 10:06 jamescheney

For reference, here is how distinct, grouping and ordering look in F#'s LINQ syntax (based on computation expressions):

for e in db.table do
...
where (...)
yield ...
distinct

This effectively desugars to something like query |> dedup, i.e. dedup(query).

for x in db.table do
groupBy x.a into g
where (g.Key.HasValue && g.Key.Value > 10)
select (g.Key, g.Count())

This binds g to different groupings which are objects/records with a key and other operations such as count, which can also be considered as a collection (the collection of rows encountered so far with a given key value). The where clause after the groupBy tests the key value, and could also test aggregate values such as counts or sums of the given group, e.g. g.Count() > 1. Other aggregations can be computed from the group, by embedding another query expression as follows:

for x in db.table do
groupBy x.a into g
let total =
    query {
        for y in g do
        sumByNullable y.a
    }
select (g.Key, g.Count(), total)

Though this looks like a nested loop, F# supposedly turns this into a single idiomatic group-by SQL query, however I don't know what algorithm is used or if it is complete. The presence of an additional collection-type variable g which is not a base table seems like a potential obstacle to proving the standard normalization results.

Finally for ordering, this seems to be handled similarly to grouping:

for x in db.table do
sortBy x.n
select student

My impression is that this should be performed as a postprocessing step after the result is produced, rather than ordering the "valuations" generated by the for/where (but maybe it doesn't matter) but I'm not sure how this is translated down to LINQ interface calls or how these are compiled into SQL.

Finally here is a query combining both grouping and ordering:

for x in db.table do
groupBy x.a into g
where (g.Count() > 1)
sortByDescending (g.Count())
select (g.Key, g.Count())

I don't know what happens if the sorting is done before the grouping.

jamescheney avatar Jul 09 '20 14:07 jamescheney