dagster
dagster copied to clipboard
whitelist the asset tags we store in the asset tags table (to query by)
Summary & Motivation
We have been storing a sizeable number of tags on asset events to track all sorts of things, including:
- multidimensional partitions
- data version
- code version
- input lineage
Most of these tags are only used by reading them out of the materialization event body.
All of them are currently getting stored within store_event
to the asset_event_tags
table. This is unnecessary for most of these tags.
The asset_event_tags
table is queried in two places:
- For a multi-dimensional partitioned asset, find all dimensions that have ever been materialized (
get_event_tags_for_asset
) - For a given asset, find the last data version by partition (
get_latest_tags_by_partition
).
This PR changes the store_event logic so that it only persists asset tag rows that match one of these two cases. This should save about 75% of the rows that we are currently writing for storing materializations/observations.
It's a little awkward, since we've provided a very generic API to service a very narrow set of queries, and we have to provide backwards compatibility for that API. This change should be pretty safe, though, since it is a private API (note that the signature has not changed at all).
Adding narrower APIs (e.g. get_latest_data_version_by_partition
, get_materialized_partition_dimensions
) was considered, but ultimately moved to be in a future diff, since we may need to workshop the exact names / signature.
Background context of this tag data in Cloud: https://www.notion.so/dagster/Migrating-asset-event-tags-in-Cloud-83508e282b2f4b508cbe78d5348414eb
How I Tested These Changes
BK
Is the proposal that if we need to query that information without getting a full record we'll add specialized indexes for them?
Is the proposal that if we need to query that information without getting a full record we'll add specialized indexes for them?
Specifically:
For data version, I think that data should be available in an asset partitions table. We may not need to store in asset_event_tags
table, but this PR does not change any reads.
For multidimensional partitions, this could potentially be a dedicated table, but I could see a world in which the current asset_event_tags
table ONLY contains multi-dimensional partition key/values in it.
Generally:
We should think about whether we need to index materializations by a certain tag or not before deciding whether to store in this table. We are not generally disciplined about this with run tags either, which is why things like dagster/image
and dagster-k8s/config
get added as run tags even though I doubt it's used to search for runs very often (there's just not a good other place to stash key/value information). If we had a distinction of system metadata and user metadata, we could shift a lot of things to system metadata.
Is the proposal that if we need to query that information without getting a full record we'll add specialized indexes for them?
Yes
For data version, I think that data should be available in an asset partitions table.
Yes please!
We should think about whether we need to index materializations by a certain tag or not before deciding whether to store in this table. We are not generally disciplined about this with run tags either, which is why things like dagster/image and dagster-k8s/config get added as run tags even though I doubt it's used to search for runs very often (there's just not a good other place to stash key/value information).
100%. Also dagster-k8s/config
should not be a tag at all, but metadata.