ditto icon indicating copy to clipboard operation
ditto copied to clipboard

Provide "aggregation" functions in Ditto search

Open thjaeckle opened this issue 2 years ago • 4 comments

Utilising the MongoDB search index, Ditto could support aggregation queries via its search API, e.g. in order to find out the mean temperature value of hundreds/thousands of selected things.

Our used RQL syntax also provides "aggregations":

aggregate(<property|function>,...) - Aggregates the array, grouping by objects that are distinct for the provided properties, and then reduces the remaining other property values using the provided functions

..

sum(<property?>) - Finds the sum of every value in the array or if the property argument is provided, returns the sum of the value of property for every object in the array mean(<property?>) - Finds the mean of every value in the array or if the property argument is provided, returns the mean of the value of property for every object in the array max(<property?>) - Finds the maximum of every value in the array or if the property argument is provided, returns the maximum of the value of property for every object in the array min(<property?>) - Finds the minimum of every value in the array or if the property argument is provided, returns the minimum of the value of property for every object in the array recurse(<property?>) - Recursively searches, looking in children of the object as objects in arrays in the given property value

If any adopter of Ditto already had a need for such kind of aggregations, please provide feedback and your use-cases for such a feature.

thjaeckle avatar Oct 28 '22 11:10 thjaeckle

Hi @thjaeckle, I recently thought about this feature. We have a clear need to have aggregations of important thing states to manage a large fleet of devices. To have an overview of groups of devices we need to aggregate the connectivity or health or update state of the devices within a group efficiently. Current solution is to replicate the important states in a relational db and update the replica based on a Ditto Kafka connection. Aggregation is then done efficiently in the RDB (home turf for that). But not good to having need for two databases.

So if there is an easy way to use aggregation capabilities of the MongoDB this would be beneficial.

  • aggregations should be possible based on the values of an attribute (e.g. "groupId")
  • policies of the things should be considered for aggregation (only aggregate value that the subject is allowed to see)

thfries avatar Apr 02 '23 16:04 thfries

@thfries very interesting, thx for the input

aggregations should be possible based on the values of an attribute (e.g. "groupId")

You mean that eg the "groupId" would be used as "group by" in an aggregation? But having another specified field?

From your comment I assume that you would need a "distinct count" aggregation? I can currently not imagine how to declare such a aggregation with the RQL syntax.

For other queries and interpreting the sparse RQL documentation (and the lack of a specified count aggregation), I would assume that e.g. a "count" aggregation could look like:

GET /api/2/search/things?filter=aggregate(attributes/groupId,count())

With a response being e.g.:

{
  "items": [
    {
      "my-group-1": 42
    },
    {
      "my-group-2": 7
    }
  ], 
... 
}

Another example to find out the mean temperature of a room with devices providing temperature readings:

GET /api/2/search/things?filter=aggregate(attributes/roomId,mean(features/temperature/properties/currentReading))

With a response being e.g.:

{
  "items": [
    {
      "kitchen": 21.32
    },
    {
      "living": 22.03
    }
  ],
... 
}

thjaeckle avatar Apr 02 '23 18:04 thjaeckle

Thanks for your examples. Sorry for not being precise here. You're right, the use case is not only about aggregation but also about grouping. The full example in SQL would be:

select room, connectivity, count(*) from thing group by room, connectivity

with a response being:

room connectivity count(*)
kitchen online 3
living offline 1
living online 4

So the interesting part is to get a statistic overview of your fleet. May be using a SQL db for that purpose is a valid solution. But I know relational databases much better than RQL or MongoDB...

thfries avatar Apr 03 '23 08:04 thfries

I try to find a syntax which would support your query..

GET /api/2/search/things?filter=aggregate(attributes/roomId,features/connectivity/properties/state,count())

With a response being e.g.:

{
  "items": [
    {
      "attributes/roomId": "kitchen", 
      "features/connectivity/properties/state": "online", 
      "count()": 3
    },
    {
      "attributes/roomId": "living", 
      "features/connectivity/properties/state": "offline", 
      "count()": 1
    },
    {
      "attributes/roomId": "living", 
      "features/connectivity/properties/state": "online", 
      "count()": 4
    }
  ],
... 
}

Thinkable would also be sorting aggregations:

GET /api/2/search/things?filter=aggregate(attributes/roomId,features/connectivity/properties/state,count())&sort(-count())

Translated to MongoDB this query could be (I did not verify yet):

db.search.aggregate([
    {$group : {_id:{"attributes/roomId":"$t.attributes.roomId", "features/connectivity/properties/state":"$t.features.connectivity.properties.state"}, "count()":{$count:{}}}},
    {$sort : {value:-1}}
])

When I find the time, I will try those aggregation queries in MongoDB. 😉

thjaeckle avatar Apr 03 '23 18:04 thjaeckle