data-infra icon indicating copy to clipboard operation
data-infra copied to clipboard

Deprecate obsolete `staging` tables/views

Open lauriemerrell opened this issue 2 years ago • 1 comments

As an analytics engineer, I want to deprecate outdated & unused tables and views in our data warehouse to reduce the risk of users accidentally using incorrect or obsolete data and to reduce noise and clutter in audit activities (for example, auditing for models lacking documentation.)

AC:

  • [ ] Investigate all models in the staging dataset that aren't being updated regularly (see query below)
    • [ ] Check whether they appear in the dbt project (some of these are incremental models)
    • [ ] Check whether there is documentation of their intended purpose
  • [ ] Deprecate unused staging models
-- query to identify outdated models
SELECT
  table_name, table_type, creation_time
FROM
  `cal-itp-data-infra`.`staging`.INFORMATION_SCHEMA.TABLES
WHERE
  creation_time < "<yesterday's date>"
ORDER BY creation_time

lauriemerrell avatar Nov 13 '23 17:11 lauriemerrell

Size is "large" because deprecation process spans multiple weeks

lauriemerrell avatar Nov 13 '23 17:11 lauriemerrell

I am using a different query to list tables from other datasets:

SELECT  table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.`region-us-west2`.INFORMATION_SCHEMA.TABLES
WHERE creation_time < "2025-01-01" and table_schema not in ('audit', 'blackcat_raw')

The problem is selecting only based on creation_time does not give only "deprecated" tables. We need to find an easier way to compare our codebase with the existing models on BigQuery.

erikamov avatar Mar 12 '25 20:03 erikamov

One option might be loading information from the dbt manifest/run results into a temporary table and using that to identify models that still exist in the codebase

lauriemerrell avatar Mar 13 '25 02:03 lauriemerrell

Thanks. I will check the manifest.

erikamov avatar Mar 13 '25 17:03 erikamov

I created a table on staging cal-itp-data-infra-staging.erikap.dbt_models with the list of models inside the codebase, based on @lauriemerrell's idea. :D

I run the query bellow and it is currently returning 279 models (122 are staging) that can be potentially dropped.

The datasets 'audit', 'blackcat_raw', and 'uploaded_data' were removed because are not inside data-infra/warehouse/models, but will be reviewed.

-- query to identify models not inside codebase
SELECT table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.`region-us-west2`.INFORMATION_SCHEMA.TABLES AS tables
WHERE table_schema not in ('audit', 'blackcat_raw', 'uploaded_data')
  AND NOT EXISTS (SELECT 1
                    FROM `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models
                   WHERE models.table_schema = tables.table_schema
                     AND models.table_name = tables.table_name)
ORDER BY table_schema, table_name

erikamov avatar Mar 13 '25 23:03 erikamov

This is a list of 49 obsolete tables created on staging dataset that belongs to a different dataset.

Image

obsolete staging tables on wrong schema.csv

Query:

SELECT tables.table_schema, tables.table_name, table_type, creation_time, models.table_schema AS correct_table_schema
  FROM `cal-itp-data-infra`.staging.INFORMATION_SCHEMA.TABLES AS tables
 INNER JOIN `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models 
    ON models.table_name = tables.table_name 
   AND models.table_schema != tables.table_schema
 ORDER BY tables.table_name

erikamov avatar Mar 14 '25 18:03 erikamov

I generated another list with 73 tables/views created on staging dataset that are not in our codebase.

@csuyat-dot and @KatrinaMKaiser, some models were for NTD validation (blackcat), but not sure if we need to keep or not: stg_ntd_2022_rr20_financial, stg_ntd_2022_rr20_service, stg_ntd_2023_a10, stg_ntd_2023_a30_assetandresourceinfo, stg_ntd_2023_rr20_rural, stg_ntd_2023_rr20_urban_tribal, int_ntd_2023_rr20_service_alldata, int_ntd_2023_rr20_service_ratios, int_ntd_rr20_service_alldata

Do you all see any other models that we would need to keep? @evansiroky @vevetron @charlie-costanzo

Image

obsolete tables on staging and not in our codebase.csv

Query:

SELECT table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.staging.INFORMATION_SCHEMA.TABLES AS tables
WHERE NOT EXISTS (SELECT 1
                    FROM `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models 
                    WHERE models.table_name = tables.table_name)
ORDER BY table_name

erikamov avatar Mar 14 '25 18:03 erikamov

I think in general, stuff on staging isn't needed except for testing. I'm not doing any active development these days, so I'm not using anything.

evansiroky avatar Mar 14 '25 18:03 evansiroky

All 49 obsolete models created on staging dataset that belongs to a different dataset were removed today.

erikamov avatar Mar 18 '25 17:03 erikamov

Also dropped these obsolete models (from the 73 list):

staging.int_device_transaction_pairs_common_fields
staging.int_gtfs_quality__feed_aggregator_rt
staging.int_gtfs_quality__feed_aggregator_schedule
staging.int_gtfs_quality__feed_listed_sa
staging.int_gtfs_quality__feed_listed_schedule
staging.int_gtfs_quality__feed_listed_tu
staging.int_gtfs_quality__feed_listed_vp
staging.int_gtfs_quality__feed_present_service_alerts staging.int_gtfs_quality__feed_present_trip_updates staging.int_gtfs_quality__feed_present_vehicle_positions
staging.int_gtfs_quality__guideline_checks_long_new_index staging.int_gtfs_quality__no_rt_critical_validation_errors
staging.int_gtfs_quality__trip_planner_rt staging.int_gtfs_quality__trip_planner_schedule staging.int_gtfs_rt__service_alerts_trip_summaries
staging.int_gtfs_rt__trip_updates_no_stop_times staging.int_gtfs_rt__trip_updates_summaries staging.int_gtfs_rt__vehicle_positions_trip_summaries staging.int_gtfs_schedule__deduped_attributions staging.int_gtfs_schedule__deduped_fare_leg_rules staging.int_gtfs_schedule__deduped_fare_products
staging.int_gtfs_schedule__deduped_fare_rules staging.int_gtfs_schedule__deduped_fare_transfer_rules
staging.int_gtfs_schedule__deduped_feed_info
staging.int_gtfs_schedule__deduped_stops
staging.int_gtfs_schedule__deduped_transfers
staging.int_gtfs_schedule__deduped_trips
staging.int_gtfs_schedule__incremental_shapes staging.int_gtfs_schedule__incremental_stop_times staging.int_gtfs_schedule__incremental_stops
staging.int_gtfs_schedule__incremental_trips
staging.int_littlepay__cleaned_micropayment_device_transactions staging.int_littlepay__customer_funding_source_vaults staging.int_littlepay__customers
staging.int_littlepay__device_transaction_types staging.int_transit_database__gtfs_datasets_history staging.int_transit_database__gtfs_service_data_history staging.int_transit_database__organizations_history staging.int_transit_database__service_datasets_pivoted
staging.int_transit_database__services_history
staging.payments_gtfs_datasets
staging.stg_gtfs_guidelines__complete_wheelchair_accessibility_data staging.stg_gtfs_guidelines__fact_daily_transitland_url_check staging.stg_gtfs_guidelines__feed_guideline_index staging.stg_gtfs_guidelines__include_tts
staging.stg_gtfs_guidelines__no_expired_services
staging.stg_gtfs_guidelines__no_rt_critical_validation_errors staging.stg_gtfs_guidelines__no_validation_errors staging.stg_gtfs_guidelines__no_validation_errors_in_last_30_days staging.stg_gtfs_guidelines__pathways_valid staging.stg_gtfs_guidelines__schedule_downloaded_successfully staging.stg_gtfs_guidelines__schedule_feed_on_transitland staging.stg_gtfs_guidelines__service_alerts_feed_on_transitland staging.stg_gtfs_guidelines__service_alerts_feed_present
staging.stg_gtfs_guidelines__shapes_file_present
staging.stg_gtfs_guidelines__shapes_for_all_trips staging.stg_gtfs_guidelines__shapes_valid staging.stg_gtfs_guidelines__technical_contact_listed staging.stg_gtfs_guidelines__trip_id_alignment
staging.stg_gtfs_guidelines__trip_updates_feed_on_transitland staging.stg_gtfs_guidelines__trip_updates_feed_present
staging.stg_gtfs_guidelines__vehicle_positions_feed_on_transitland
staging.stg_gtfs_guidelines__vehicle_positions_feed_present

I left out these models for now to give more time for input:

staging.ntd_agency_to_organization
staging.int_ntd_2023_rr20_service_alldata staging.int_ntd_2023_rr20_service_ratios
staging.int_ntd_rr20_service_alldata
staging.stg_ntd_2022_rr20_financial staging.stg_ntd_2022_rr20_service staging.stg_ntd_2023_a10
staging.stg_ntd_2023_a30_assetandresourceinfo staging.stg_ntd_2023_rr20_rural staging.stg_ntd_2023_rr20_urban_tribal

erikamov avatar Mar 18 '25 18:03 erikamov

Remaining obsolete Staging tables were delete.

Image
SELECT table_schema, table_name, table_type, creation_time
FROM `cal-itp-data-infra`.staging.INFORMATION_SCHEMA.TABLES AS tables
WHERE NOT EXISTS (SELECT 1
                    FROM `cal-itp-data-infra-staging`.`erikap`.dbt_models AS models 
                    WHERE models.table_name = tables.table_name)
ORDER BY table_name

erikamov avatar Mar 27 '25 22:03 erikamov

Are we able to also remove obsolete staging tables from other users using the same query?

evansiroky avatar Apr 07 '25 17:04 evansiroky