dagster icon indicating copy to clipboard operation
dagster copied to clipboard

whitelist the asset tags we store in the asset tags table (to query by)

Open prha opened this issue 9 months ago • 4 comments

Summary & Motivation

We have been storing a sizeable number of tags on asset events to track all sorts of things, including:

  1. multidimensional partitions
  2. data version
  3. code version
  4. input lineage

Most of these tags are only used by reading them out of the materialization event body.

All of them are currently getting stored within store_event to the asset_event_tags table. This is unnecessary for most of these tags.

The asset_event_tags table is queried in two places:

  1. For a multi-dimensional partitioned asset, find all dimensions that have ever been materialized (get_event_tags_for_asset)
  2. For a given asset, find the last data version by partition (get_latest_tags_by_partition).

This PR changes the store_event logic so that it only persists asset tag rows that match one of these two cases. This should save about 75% of the rows that we are currently writing for storing materializations/observations.

It's a little awkward, since we've provided a very generic API to service a very narrow set of queries, and we have to provide backwards compatibility for that API. This change should be pretty safe, though, since it is a private API (note that the signature has not changed at all).

Adding narrower APIs (e.g. get_latest_data_version_by_partition, get_materialized_partition_dimensions) was considered, but ultimately moved to be in a future diff, since we may need to workshop the exact names / signature.

Background context of this tag data in Cloud: https://www.notion.so/dagster/Migrating-asset-event-tags-in-Cloud-83508e282b2f4b508cbe78d5348414eb

How I Tested These Changes

BK

prha avatar May 06 '24 23:05 prha

Is the proposal that if we need to query that information without getting a full record we'll add specialized indexes for them?

schrockn avatar May 07 '24 14:05 schrockn

Is the proposal that if we need to query that information without getting a full record we'll add specialized indexes for them?

Specifically: For data version, I think that data should be available in an asset partitions table. We may not need to store in asset_event_tags table, but this PR does not change any reads. For multidimensional partitions, this could potentially be a dedicated table, but I could see a world in which the current asset_event_tags table ONLY contains multi-dimensional partition key/values in it.

Generally: We should think about whether we need to index materializations by a certain tag or not before deciding whether to store in this table. We are not generally disciplined about this with run tags either, which is why things like dagster/image and dagster-k8s/config get added as run tags even though I doubt it's used to search for runs very often (there's just not a good other place to stash key/value information). If we had a distinction of system metadata and user metadata, we could shift a lot of things to system metadata.

prha avatar May 07 '24 15:05 prha

Is the proposal that if we need to query that information without getting a full record we'll add specialized indexes for them?

Yes

prha avatar May 07 '24 16:05 prha

For data version, I think that data should be available in an asset partitions table.

Yes please!

We should think about whether we need to index materializations by a certain tag or not before deciding whether to store in this table. We are not generally disciplined about this with run tags either, which is why things like dagster/image and dagster-k8s/config get added as run tags even though I doubt it's used to search for runs very often (there's just not a good other place to stash key/value information).

100%. Also dagster-k8s/config should not be a tag at all, but metadata.

schrockn avatar May 07 '24 20:05 schrockn