dbt-snowflake-queries icon indicating copy to clipboard operation
dbt-snowflake-queries copied to clipboard

Handling views vs. tables

Open ian-whitestone opened this issue 3 years ago • 2 comments

First off, thanks for putting this together! Great stuff.

In stg_snowflake_query_tables.sql you are getting all the tables accessed by a query by looking at the direct_objects_accessed column in snowflake.account_usage.access_history. I was reading through these docs to better understand the difference between direct_objects_accessed and base_objects_accessed.

Let's say you have the following query:

SELECT * FROM my_view

And my_view is defined as CREATE VIEW my_view AS ( SELECT * FROM my_table WHERE foo = 'bar'). Based on this example in the docs, it sounds like:

  • my_view will appear in direct objects accessed
  • my_table will appear in base objects accessed

Note, probably worth validating this is actually the case. Let's assume it is for now.

As a result, querying stg_snowflake_query_tables could lead some users to think that my_table is not being used, since the only thing getting queried is my_view.

Could be worth throwing a caveat in the README.md (again assuming this is true). As a fix, you could look at modifying the code to check both base_objects and direct_objects, then do a union and distinct across all those (since in most cases, the same table will appear in both direct and base), or keep both around and add some boolean like is_directly_accessed and is_indirectly_accessed -- in base but not in direct.

ian-whitestone avatar Jul 20 '22 17:07 ian-whitestone

Thanks for raising this issue. I originally made the decision to use direct and ignore base because I wanted to align the notion of "object" here with dbt's notion of "model". In the context of dbt I generally don't distinguish between views and tables.

To get at indirect usage, I have a model (not in this repo, yet) derived from the dbt manifest called node_descendents (expanded from the child_map), and I left join onto that, and aggregate both 'direct usage' and 'downstream usage'.

Is there another case where knowing about underlying views (inside or outside of dbt) would be valuable? I've considered making the table more polymorphic- like the modified_objects could be rows (though there seems to always be exactly 1) with an action label: modified, accessed, indirectly_accessed.

jaysobel avatar Jul 20 '22 17:07 jaysobel

👍

Is there another case where knowing about underlying views (inside or outside of dbt) would be valuable?

my current use case is trying to find tables I can remove to save storage costs. for that, i need to know what underlying tables (the things that incur the storage costs) are being used, either directly or indirectly through a view.

ian-whitestone avatar Jul 20 '22 18:07 ian-whitestone