kaskada icon indicating copy to clipboard operation
kaskada copied to clipboard

feat: support collection types

Open kerinin opened this issue 1 year ago • 5 comments

Most data types support collections, for example maps and lists. This includes Parquet, JSON, Avro and Protobuf. We should support collections as well.

Collections bring a number of benefits:

  • More complete support of existing data types, for example Parquet
  • Minimizes the need for pre-processing data sources that include collections
  • Enables a number of valuable data operations, for example aggregation within an event and aggregation across entities (ie "what is the most common product category").

As a first step, Kaskada can support simple singleton methods:

let first_foo = Table.list_of_foos[0]
let bar = Table.map_of_foos["bar"]

This allows users to access specific elements within a collection, without introducing the complexities of operating across all elements in a collection (which likely requires bag semantics).

kerinin avatar May 24 '23 04:05 kerinin

We've also talked about supporting unnest and possibly nest, similar to BigQuery (and others). This would require some of the "bag semantics" work to support multiple simultaneous values.

bjchambers avatar May 24 '23 16:05 bjchambers

In the shortest term, we could even support collections but leave them as "opaque" columns. Basically -- read them in, and if they are plumbed through to the output, write them out. Even without being able to index on them (or other singleton methods, such as contains), this would be useful since it would allow working with data sets that contain collections without dropping the column or otherwise failing.

bjchambers avatar May 24 '23 16:05 bjchambers

This also seems like a partial duplicate of #367. These should possibly be merged.

bjchambers avatar May 24 '23 16:05 bjchambers

I think (as #367 identified) this would also require some support for generic types. Specifically, I think we would something like. The specific methods / behaviors are TBD, but the key point is that we use generic types to describe how the result relates to the arguments.

get(collection: Keyed<K, V>, key: K) -> V

# And lists of `T` are `Keyed<u64, T>` and maps are also keyed.
# Then `collection[key]` compiles to `get(collection, key)`.

contains(collection: Container<T>, item: T) -> bool

# And lists of `T` are `Container<T>` and `Maps<K, V>` are `Container<K>`.

bjchambers avatar May 24 '23 16:05 bjchambers

Copying comment from #367

Representing maps in Fenl will likely depend on bag semantics. BigQuery, for example, unnests (https://cloud.google.com/bigquery/docs/arrays) arrays to expand the single row to multiple rows. However, since we have subsorts, we cannot do that now.

Unnest a map:

T1, Map{foo: 5, bar, 6, baz: 7} -> T1, foo, 5 T1, bar, 6 T1, baz, 7

However, we could still do singular functions that treat the map as a single value. We could support:

contains(map: map<K, V>, key: K)-> bool get(map: map<K, V>, key: K) -> V

We cannot easily support plural functions, like "aggregate each key's value, produce a list of results".

Singular functions would still require a non-trivial amount of work in type-inference/compilation, due to the generic handling required.

jordanrfrazier avatar Jun 06 '23 22:06 jordanrfrazier