links
links copied to clipboard
Grouping and aggregation (WIP)
This PR will (eventually) add to Links the ability to generate SQL queries with grouping and aggregation. Currently, it merely adds grouping syntax and refactors the normalisation of comprehensions, in the hope that the resulting code will be extended more easily to the normalization of grouping+aggregation queries.
The proposed technique to deal with grouping and aggregation has a very experimental status, so for the time being this PR will be marked as draft.
I'm not sure why this needs to be
unsafe. Under the assumption that the second argument is a finite map, isn'tfun lookupG(x,l) { for (ab <- l) where (fst(a) == x) b }equivalent?
Ah yes. It can indeed be defined as a safe operation (modulo calls to other unsafe operations (concatMap)).
Ah yes. It can indeed be defined as a safe operation (modulo calls to other unsafe operations (concatMap)).
Actually, I think my reasoning was that the semantics of the two definitions is not the same for inputs that are not valid maps (more than one entry for some keys). But surely we don't care about ill-formed inputs.
Language-integrated query in this PR introduces a special treatment for finite maps, particularly those resulting from grouping a relation. The type of finite maps is of the form [(k, [v])], where k and v are (usually?) tuple types.
The following operations are provided in the prelude and can be used in grouping queries:
groupBy;
fun : ((a) -b-> c, [a]) -b-> [(c, [a])]
groupBy(f, l) takes a list l and a grouping criterion f: it applies f to each item in the list to obtain its grouping key, and finally returns a finite map, i.e. a list of pairs associating to each grouping key the list of items of the original list l that share the same grouping key.
e.g.
groupBy(fun (x) { (is_even = even(x)) }, [1, 2, 3]) = [((is_even = true), [2]), ((is_even = false), [1,3])]
concatMapKey;
fun : ((a) -b-> [c], [(a, [_])]) -b-> [c]
concatMapKey works like concatMap, but it takes as input a finite map rather than any list; it performs comprehension over the (deduplicated) key set of the input map (for this reason, the output type of the finite map argument is irrelevant).
lookupG;
fun : (a, [(a, [b])]) -> [b]
Given a key k and a finite map m, lookupG(k,m) returns the list of values associated by the map m to the key k.
aggBy;
fun : ([(a, [b])], ([b]) -c-> d) -c-> [(a, d)]
aggBy(m, f) takes a finite map m and an aggregation criterion f: its purpose is to apply the aggregation to the input map on a key by key basis
This is where our LINQ infrastructure gets hacky: we can only support f when it is in the form:
fun (t) { (outlabel1 = agg1(for (x <- t) [x.inlabel1]), ..., outlabeln = aggn(for (x <- t) [x.inlabeln])) }
Where agg1, ... aggn are certain aggregation functions defined in the prelude:
sum, sumf, avg, avgF, min_list, minF_list, max_list, maxF_list
Any other use of aggBy will NOT be databaseable.
Example queries can be found in the database/grouping.links test file (which requires tables created by database/grouping-create.links)
Thank you @jamescheney. Yes some cleanup is still in order (mainly disabling debug prints, on top of flagging unused variables).
Sorry, just wanted to see if that was the only reason the CI was failing or if there were other problems.
Modifying it to
query nested {for (x <-v- covid_data) for (y <-- subcategory) for (z <-- csv_file) where ... }appears to have resolved the error. I hypothesise that the addition of "nested" means it is no longer using the new mixing code. Should such a query work under the mixing code?
Further experimentation has shown that the interaction of the mixing code and temporal queries leads to the error. For example,
query { for ( x <-v- covid_data ) [vtData(x)]
gives a similar error to above and
query nested { for ( x <-v- covid_data ) [vtData(x)]
doesn't.
In the gtopdb code, only query nested and query flat are used, which possibly explains the success of the tests with mixing_norm=true
Modifying it to
query nested {for (x <-v- covid_data) for (y <-- subcategory) for (z <-- csv_file) where ... }appears to have resolved the error. I hypothesise that the addition of "nested" means it is no longer using the new mixing code. Should such a query work under the mixing code?
Further experimentation has shown that the interaction of the mixing code and temporal queries leads to the error. For example,
query { for ( x <-v- covid_data ) [vtData(x)]gives a similar error to above and
query nested { for ( x <-v- covid_data ) [vtData(x)]doesn't.
Thanks! The mixing normaliser wasn't exactly handling temporal tables. I have ported Simon's code and now queries such as the ones you mentioned are handled correctly.
This should now be ready to be merged.