typedb icon indicating copy to clipboard operation
typedb copied to clipboard

[TypeDB 3.x] Grouped aggregates and non-terminal reductions

Open cxdorn opened this issue 5 months ago • 7 comments

Problem to Solve

Grouped aggregations are a core data analysis command

In its current specification, TypeDB "almost" supports this features, but it could be made much more convenient.

Current Workaround

For example, consider counting posts from a specific country by tag. Using #7038 we can write:

with fun count_posts($tag: tag, $country: country) -> long:
  match
    $post isa post; 
    location_check($post, $country) == true;
    (post: $post, tag: $tag) isa tagging;
  return count(post);
match
  $country isa country, has name "Australia";
  $post isa post; 
  location_check($post, $country) == true;
  (post: $post, tag: $tag) isa tagging;
select $tag, $country;
distinct;
  $count = count_posts($tag, $country)
fetch:
  "tag": $tag.text;
  "post count": $count;

But as this simple example shows, the current work around requires code duplication.

Proposed Solution

Extend the role of the reduce operator in pipelines (in particular, make it non-terminal). More specifically:

  • introduce as in reduce statement to specify the variables in which to store values for the next stage of the pipeline,
  • introduce @group($var, ...) annotation as a way to specify variables by which to group the incoming stream before reducing each group.

With this, the previous example becomes:

match
  $country isa country, has name "Australia";
  $post isa post; 
  location_check($post, $country) == true;
  (post: $post, tag: $tag) isa tagging;
reduce @group($tag) count($post) as $count;
fetch:
  "tag": $tag.text;
  "post count": $count;

Additional Information

References: https://duckdb.org/2022/03/07/aggregate-hashtable.html https://www.ibm.com/docs/en/psfa/7.1.0?topic=functions-grouped-aggregates

cxdorn avatar Sep 10 '24 09:09 cxdorn