morphir-elm icon indicating copy to clipboard operation
morphir-elm copied to clipboard

Support grouping by multiple columns in Morphir SDK Aggregations

Open jonathanmaw opened this issue 2 years ago • 3 comments

Morphir.SDK.Aggregate.groupBy takes an argument called getKey to specify the columns to group by. This can be either:

  • a single fieldFunction (implemented)
  • A tuple of keys constructed with the keyN function (from Morphir.SDK.Key), where N is between 2 and 16.

Implementing this will involve:

  • Create an example that groups by multiple keys.
    • This will likely also involve working out what key as provided to the lambda we pass into aggregate (i.e. \key inputs ->) actually is in that case, and how to use it.
    • in Spark, grouping by multiple columns causes the columns to be repeated in the output as separate columns, is the same true in Morphir SDK Aggregations?
  • Change the AggregationCall type to store a list of Names for Group Key, and probably a List of Maybe Names for Returned Group Key.
  • Change the constructAggregationCall function to parse the keyFields as a single field or a keyN successfully. (perhaps needing to change under what circumstances we restrict the number of keyFields)
  • Extend the Aggregate ObjectExpression to take a List of Strings for its columns to group by, with corresponding changes in objectExpressionFromAggregationCall and mapObjectExpressionToScala, and Morphir.Spark.API.aggregate
  • Write tests to cover the new example
  • Update documentation to describe what's now supported and what the output looks like.

jonathanmaw avatar Aug 25 '22 11:08 jonathanmaw

To expand on this a bit:

The first step is to familiarise oneself with how keyN works in groupBy and aggregate.

For a list of Antiques, 'source'

source
    |> groupBy (key2 .category .product)
    |> aggregate
        (\ key inputs ->
            ...
        )

Find out what 'key' is, and how to create named columns from it.

i.e. with input data like

category            product     ageofItem   ...
HouseHoldCollection Furniture   10.0        ...
HouseHoldCollection Furniture   12.0        ...
HouseHoldCollection Plates      20.0        ...
HouseHoldCollection Plates      22.0        ...
PaintCollections    Paintings   50.0        ...
PaintCollections    Paintings   52.0        ...

How do you get

category            product     oldest
HouseHoldCollection Furniture   12.0
HouseHoldCollection Plates      22.0
PaintCollections    Paintings   52.0

Existing examples that make use of groupBy and aggregate can be found in tests-integration/spark/model/src/SparkTests/AggregationTests.elm.

jonathanmaw avatar Aug 31 '22 15:08 jonathanmaw

AggregationCall and constructAggregationCall can be found in src/Morphir/SDK/Aggregate.elm.

The purpose of "group key" vs. "returned group key" is highlighted in https://github.com/finos/morphir-elm/issues/799#issuecomment-1191714998 the gist of it is that in elm, when you do

        testDataSet
            |> groupBy .key1
            |> aggregate
                (\key inputs ->
                    { key = key
                    , count = inputs (count |> withFilter (\a -> a.value < 7))
                    , sum = inputs (sumOf .value)
                    , max = inputs (maximumOf .value)
                    , min = inputs (minimumOf .value)
                    }
                )

that creates a field named "key" from a field that was named "key1". the group key is "key1", while the returned group key is "key". This work will be to handle multiple group keys. We currently ignore the returned group key (see #842 for the task to implement it).

jonathanmaw avatar Aug 31 '22 16:08 jonathanmaw

I created a PR with all my extant work on this at https://github.com/finos/morphir-elm/pull/911

jonathanmaw avatar Oct 17 '22 09:10 jonathanmaw