data-infra
data-infra copied to clipboard
Deprecate obsolete `staging` tables/views
As an analytics engineer, I want to deprecate outdated & unused tables and views in our data warehouse to reduce the risk of users accidentally using incorrect or obsolete data and to reduce noise and clutter in audit activities (for example, auditing for models lacking documentation.)
AC:
- [ ] Investigate all models in the
stagingdataset that aren't being updated regularly (see query below)- [ ] Check whether they appear in the dbt project (some of these are incremental models)
- [ ] Check whether there is documentation of their intended purpose
- [ ] Deprecate unused
stagingmodels
-- query to identify outdated models
SELECT
table_name, table_type, creation_time
FROM
`cal-itp-data-infra`.`staging`.INFORMATION_SCHEMA.TABLES
WHERE
creation_time < "<yesterday's date>"
ORDER BY creation_time
Size is "large" because deprecation process spans multiple weeks
I am using a different query to list tables from other datasets:
SELECT table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.`region-us-west2`.INFORMATION_SCHEMA.TABLES
WHERE creation_time < "2025-01-01" and table_schema not in ('audit', 'blackcat_raw')
The problem is selecting only based on creation_time does not give only "deprecated" tables. We need to find an easier way to compare our codebase with the existing models on BigQuery.
One option might be loading information from the dbt manifest/run results into a temporary table and using that to identify models that still exist in the codebase
Thanks. I will check the manifest.
I created a table on staging cal-itp-data-infra-staging.erikap.dbt_models with the list of models inside the codebase, based on @lauriemerrell's idea. :D
I run the query bellow and it is currently returning 279 models (122 are staging) that can be potentially dropped.
The datasets 'audit', 'blackcat_raw', and 'uploaded_data' were removed because are not inside data-infra/warehouse/models, but will be reviewed.
-- query to identify models not inside codebase
SELECT table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.`region-us-west2`.INFORMATION_SCHEMA.TABLES AS tables
WHERE table_schema not in ('audit', 'blackcat_raw', 'uploaded_data')
AND NOT EXISTS (SELECT 1
FROM `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models
WHERE models.table_schema = tables.table_schema
AND models.table_name = tables.table_name)
ORDER BY table_schema, table_name
This is a list of 49 obsolete tables created on staging dataset that belongs to a different dataset.
obsolete staging tables on wrong schema.csv
Query:
SELECT tables.table_schema, tables.table_name, table_type, creation_time, models.table_schema AS correct_table_schema
FROM `cal-itp-data-infra`.staging.INFORMATION_SCHEMA.TABLES AS tables
INNER JOIN `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models
ON models.table_name = tables.table_name
AND models.table_schema != tables.table_schema
ORDER BY tables.table_name
I generated another list with 73 tables/views created on staging dataset that are not in our codebase.
@csuyat-dot and @KatrinaMKaiser, some models were for NTD validation (blackcat), but not sure if we need to keep or not:
stg_ntd_2022_rr20_financial, stg_ntd_2022_rr20_service, stg_ntd_2023_a10, stg_ntd_2023_a30_assetandresourceinfo, stg_ntd_2023_rr20_rural, stg_ntd_2023_rr20_urban_tribal, int_ntd_2023_rr20_service_alldata, int_ntd_2023_rr20_service_ratios, int_ntd_rr20_service_alldata
Do you all see any other models that we would need to keep? @evansiroky @vevetron @charlie-costanzo
obsolete tables on staging and not in our codebase.csv
Query:
SELECT table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.staging.INFORMATION_SCHEMA.TABLES AS tables
WHERE NOT EXISTS (SELECT 1
FROM `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models
WHERE models.table_name = tables.table_name)
ORDER BY table_name
I think in general, stuff on staging isn't needed except for testing. I'm not doing any active development these days, so I'm not using anything.
All 49 obsolete models created on staging dataset that belongs to a different dataset were removed today.
Also dropped these obsolete models (from the 73 list):
staging.int_device_transaction_pairs_common_fields
staging.int_gtfs_quality__feed_aggregator_rt
staging.int_gtfs_quality__feed_aggregator_schedule
staging.int_gtfs_quality__feed_listed_sa
staging.int_gtfs_quality__feed_listed_schedule
staging.int_gtfs_quality__feed_listed_tu
staging.int_gtfs_quality__feed_listed_vp
staging.int_gtfs_quality__feed_present_service_alerts
staging.int_gtfs_quality__feed_present_trip_updates
staging.int_gtfs_quality__feed_present_vehicle_positions
staging.int_gtfs_quality__guideline_checks_long_new_index
staging.int_gtfs_quality__no_rt_critical_validation_errors
staging.int_gtfs_quality__trip_planner_rt
staging.int_gtfs_quality__trip_planner_schedule
staging.int_gtfs_rt__service_alerts_trip_summaries
staging.int_gtfs_rt__trip_updates_no_stop_times
staging.int_gtfs_rt__trip_updates_summaries
staging.int_gtfs_rt__vehicle_positions_trip_summaries
staging.int_gtfs_schedule__deduped_attributions
staging.int_gtfs_schedule__deduped_fare_leg_rules
staging.int_gtfs_schedule__deduped_fare_products
staging.int_gtfs_schedule__deduped_fare_rules
staging.int_gtfs_schedule__deduped_fare_transfer_rules
staging.int_gtfs_schedule__deduped_feed_info
staging.int_gtfs_schedule__deduped_stops
staging.int_gtfs_schedule__deduped_transfers
staging.int_gtfs_schedule__deduped_trips
staging.int_gtfs_schedule__incremental_shapes
staging.int_gtfs_schedule__incremental_stop_times
staging.int_gtfs_schedule__incremental_stops
staging.int_gtfs_schedule__incremental_trips
staging.int_littlepay__cleaned_micropayment_device_transactions
staging.int_littlepay__customer_funding_source_vaults
staging.int_littlepay__customers
staging.int_littlepay__device_transaction_types
staging.int_transit_database__gtfs_datasets_history
staging.int_transit_database__gtfs_service_data_history
staging.int_transit_database__organizations_history
staging.int_transit_database__service_datasets_pivoted
staging.int_transit_database__services_history
staging.payments_gtfs_datasets
staging.stg_gtfs_guidelines__complete_wheelchair_accessibility_data
staging.stg_gtfs_guidelines__fact_daily_transitland_url_check
staging.stg_gtfs_guidelines__feed_guideline_index
staging.stg_gtfs_guidelines__include_tts
staging.stg_gtfs_guidelines__no_expired_services
staging.stg_gtfs_guidelines__no_rt_critical_validation_errors
staging.stg_gtfs_guidelines__no_validation_errors
staging.stg_gtfs_guidelines__no_validation_errors_in_last_30_days
staging.stg_gtfs_guidelines__pathways_valid
staging.stg_gtfs_guidelines__schedule_downloaded_successfully
staging.stg_gtfs_guidelines__schedule_feed_on_transitland
staging.stg_gtfs_guidelines__service_alerts_feed_on_transitland
staging.stg_gtfs_guidelines__service_alerts_feed_present
staging.stg_gtfs_guidelines__shapes_file_present
staging.stg_gtfs_guidelines__shapes_for_all_trips
staging.stg_gtfs_guidelines__shapes_valid
staging.stg_gtfs_guidelines__technical_contact_listed
staging.stg_gtfs_guidelines__trip_id_alignment
staging.stg_gtfs_guidelines__trip_updates_feed_on_transitland
staging.stg_gtfs_guidelines__trip_updates_feed_present
staging.stg_gtfs_guidelines__vehicle_positions_feed_on_transitland
staging.stg_gtfs_guidelines__vehicle_positions_feed_present
I left out these models for now to give more time for input:
staging.ntd_agency_to_organization
staging.int_ntd_2023_rr20_service_alldata
staging.int_ntd_2023_rr20_service_ratios
staging.int_ntd_rr20_service_alldata
staging.stg_ntd_2022_rr20_financial
staging.stg_ntd_2022_rr20_service
staging.stg_ntd_2023_a10
staging.stg_ntd_2023_a30_assetandresourceinfo
staging.stg_ntd_2023_rr20_rural
staging.stg_ntd_2023_rr20_urban_tribal
Remaining obsolete Staging tables were delete.
SELECT table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.staging.INFORMATION_SCHEMA.TABLES AS tables
WHERE NOT EXISTS (SELECT 1
FROM `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models
WHERE models.table_name = tables.table_name)
ORDER BY table_name
Are we able to also remove obsolete staging tables from other users using the same query?