links icon indicating copy to clipboard operation
links copied to clipboard

Grouping and aggregation (WIP)

Open wricciot opened this issue 3 years ago • 2 comments

This PR will (eventually) add to Links the ability to generate SQL queries with grouping and aggregation. Currently, it merely adds grouping syntax and refactors the normalisation of comprehensions, in the hope that the resulting code will be extended more easily to the normalization of grouping+aggregation queries.

The proposed technique to deal with grouping and aggregation has a very experimental status, so for the time being this PR will be marked as draft.

wricciot avatar Apr 25 '22 11:04 wricciot

I'm not sure why this needs to be unsafe. Under the assumption that the second argument is a finite map, isn't

fun lookupG(x,l) {
  for (ab <- l) where (fst(a) == x) b
}

equivalent?

Ah yes. It can indeed be defined as a safe operation (modulo calls to other unsafe operations (concatMap)).

wricciot avatar May 11 '22 09:05 wricciot

Ah yes. It can indeed be defined as a safe operation (modulo calls to other unsafe operations (concatMap)).

Actually, I think my reasoning was that the semantics of the two definitions is not the same for inputs that are not valid maps (more than one entry for some keys). But surely we don't care about ill-formed inputs.

wricciot avatar May 11 '22 09:05 wricciot

Language-integrated query in this PR introduces a special treatment for finite maps, particularly those resulting from grouping a relation. The type of finite maps is of the form [(k, [v])], where k and v are (usually?) tuple types.

The following operations are provided in the prelude and can be used in grouping queries:

groupBy;
fun : ((a) -b-> c, [a]) -b-> [(c, [a])]

groupBy(f, l) takes a list l and a grouping criterion f: it applies f to each item in the list to obtain its grouping key, and finally returns a finite map, i.e. a list of pairs associating to each grouping key the list of items of the original list l that share the same grouping key.

e.g. groupBy(fun (x) { (is_even = even(x)) }, [1, 2, 3]) = [((is_even = true), [2]), ((is_even = false), [1,3])]

concatMapKey;
fun : ((a) -b-> [c], [(a, [_])]) -b-> [c]

concatMapKey works like concatMap, but it takes as input a finite map rather than any list; it performs comprehension over the (deduplicated) key set of the input map (for this reason, the output type of the finite map argument is irrelevant).

lookupG;
fun : (a, [(a, [b])]) -> [b]

Given a key k and a finite map m, lookupG(k,m) returns the list of values associated by the map m to the key k.

aggBy;
fun : ([(a, [b])], ([b]) -c-> d) -c-> [(a, d)]

aggBy(m, f) takes a finite map m and an aggregation criterion f: its purpose is to apply the aggregation to the input map on a key by key basis This is where our LINQ infrastructure gets hacky: we can only support f when it is in the form: fun (t) { (outlabel1 = agg1(for (x <- t) [x.inlabel1]), ..., outlabeln = aggn(for (x <- t) [x.inlabeln])) }

Where agg1, ... aggn are certain aggregation functions defined in the prelude: sum, sumf, avg, avgF, min_list, minF_list, max_list, maxF_list

Any other use of aggBy will NOT be databaseable.

Example queries can be found in the database/grouping.links test file (which requires tables created by database/grouping-create.links)

wricciot avatar Jun 26 '23 14:06 wricciot

Thank you @jamescheney. Yes some cleanup is still in order (mainly disabling debug prints, on top of flagging unused variables).

wricciot avatar Jun 27 '23 09:06 wricciot

Sorry, just wanted to see if that was the only reason the CI was failing or if there were other problems.

jamescheney avatar Jun 27 '23 11:06 jamescheney

Modifying it to

query nested {for (x <-v- covid_data) for (y <-- subcategory) for (z <-- csv_file) where ... }

appears to have resolved the error. I hypothesise that the addition of "nested" means it is no longer using the new mixing code. Should such a query work under the mixing code?

Further experimentation has shown that the interaction of the mixing code and temporal queries leads to the error. For example,

query { for ( x <-v- covid_data ) [vtData(x)]

gives a similar error to above and

query nested { for ( x <-v- covid_data ) [vtData(x)]

doesn't.

vcgalpin avatar Aug 30 '23 13:08 vcgalpin

In the gtopdb code, only query nested and query flat are used, which possibly explains the success of the tests with mixing_norm=true

vcgalpin avatar Aug 30 '23 13:08 vcgalpin

Modifying it to

query nested {for (x <-v- covid_data) for (y <-- subcategory) for (z <-- csv_file) where ... }

appears to have resolved the error. I hypothesise that the addition of "nested" means it is no longer using the new mixing code. Should such a query work under the mixing code?

Further experimentation has shown that the interaction of the mixing code and temporal queries leads to the error. For example,

query { for ( x <-v- covid_data ) [vtData(x)]

gives a similar error to above and

query nested { for ( x <-v- covid_data ) [vtData(x)]

doesn't.

Thanks! The mixing normaliser wasn't exactly handling temporal tables. I have ported Simon's code and now queries such as the ones you mentioned are handled correctly.

wricciot avatar Oct 02 '23 22:10 wricciot

This should now be ready to be merged.

wricciot avatar Oct 02 '23 23:10 wricciot