malloy
malloy copied to clipboard
Write column lineage to query metadata
From a conversation today with @lloydtabb --
For any given field in a query, Malloy understands what were all the inputs to that field, and thus we can fully understand the lineage of every column. We should write this lineage to query metadata.
For example, if we have a source:
source: order_items is order_items_table + {
primary_key: id
join_one:
users with user_id
measure:
revenue_per_user is sale_price.sum() / count(distinct users.id)
and a query:
query: order_items -> {
group_by: created_month is created_at.month
aggregate: revenue_per_user
}
there should be metadata associated with each column in the resultset that describes the lineage of that column.
In this example, created_month
comes from order_items.created_at
, and revenue_per_user
comes from order_items.sale_price
and order_items.users.id
. This is a simple example, but if one or more of those fields was another complex field that had multiple inputs, then this lineage forms a tree structure.
A few examples of tools that support this:
I'm not sure @carlineng, but it seems like many databases (e.g., DuckDB, Snowflake) have a COMMENT ON
SQL statement that would allow this column lineage information to be added to the metadata for the column, too.