Options to avoid losing values in collect_map
The docs for the collect_map aggregate function disclose:
If
collect_mapreceives multiple values for the same key, the last value received is retained.
This may be fine for many use cases, but we may want to offer options that preserve all values.
Details
At the time this issue is being opened, super is at commit 55d99d3.
To illustrate, we'll use this example input prices.json that's similar to the data used in the collect_map docs.
{"stock":"IBM","price":46.67}
{"stock":"APPL","price":150.13}
{"stock":"GOOG","price":87.07}
{"stock":"APPL","price":150.13}
{"stock":"GOOG","price":89.15}
By simply applying collect_map, we get the final price value seen in the input stream for each unique stock ticker.
$ super -version
Version: v1.18.0-222-g55d99d3b
$ super -Z -c 'collect_map(|{stock:price}|)' prices.json
|{
"IBM": 46.67,
"APPL": 150.13,
"GOOG": 89.15
}|
This "last value wins" behavior may be familiar to many users. For instance, ChatGPT suggested the following jq command line to combine the same stream of objects into a single JSON object.
$ jq -s 'reduce .[] as $item ({}; .[$item.stock] = $item.price)' prices.json
{
"IBM": 46.67,
"APPL": 150.13,
"GOOG": 89.15
}
However, @mccanne recently pointed out that silently dropping values may not be ideal, especially since SuperDB provides complex data types that could easily hold all values, such as storing them as a set if the user wants to keep each unique value, or an array if they want to keep every value (including repeats) in the order encountered in the stream.
This can already be achieved using existing building blocks, e.g., first invoking union in a separate step to create sets:
$ super -Z -c 'price:=union(price) by stock | collect_map(|{stock:price}|)' prices.json
|{
"IBM": |[
46.67
]|,
"APPL": |[
150.13
]|,
"GOOG": |[
87.07,
89.15
]|
}|
Or collect to create arrays:
$ super -Z -c 'price:=collect(price) by stock | collect_map(|{stock:price}|)' prices.json
|{
"IBM": [
46.67
],
"APPL": [
150.13,
150.13
],
"GOOG": [
87.07,
89.15
]
}|
If we make this kind of functionality a change of default behavior and/or new options of collect_map, it seems we'd also want to consider if we want the wrapping in the complex type to happen even for single values (such as shown in these examples with existing building blocks) or only when multiple values are observed for a single key.