mozilla-pipeline-schemas icon indicating copy to clipboard operation
mozilla-pipeline-schemas copied to clipboard

GCP: Decide what to do with tables when they are removed from this repository

Open fbertsch opened this issue 6 years ago • 3 comments
trafficstars

BQ tables are auto-generated and updated as the schema changes. Once the schema is removed from this repository, the table is dropped. We shouldn't be dropping data when a schema is removed, instead we should retain the historical data for however long the retention period is (cc @mreid-moz).

Option 1: We keep the table in the same location, allowing for the small possibility that a new schema will be written to that location (we could add automatic checking for these, it would be especially bad if the schemas weren't compatible).

Option 2: We move that data to a historical location, such that we know it's not being updated and new data is not flowing in, and a new ping can replace it; however it will remain queryable (for the duration of the retention period).

I'm leaning towards (2.), but the downside is we either need to manually change queries to point to the new location, or move views to point there (and version views for the new data).

fbertsch avatar Aug 15 '19 18:08 fbertsch

I'm also vaguely pro (2), as it would be nice to more generally have a concept of "deprecated" or "historical" data. There are many ping types we have collected over the years that we probably don't care about processing (e.g. appusage). In https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/334 we decided which pings we care about and don't care about for the purposes of schemas. I do like the idea of being able to make an active decision to no longer process a ping type by removing its schemas.

I don't like (1) because it means the state of production and the state of generated-schemas would not be precisely the same. In this case I would prefer that we never drop schemas (hitherto the standard practice). This goes back to developing a notion of "deprecated" data, which doesn't exist for ingestion currently.

whd avatar Aug 15 '19 19:08 whd

You make a good point about (1.) not matching the generated-schemas branch. A tentative plan for (2.) could be:

  1. Schema is removed from this repository
  2. Deploy notices the schema is now missing a. Copy the data to a historical location (TBD) b. Update the view that points to the historic data (related, auto-deployed views. If the view definition is not auto-deployed manual intervention could be required) c. Drop the prod table d. Deploy gcp-ingest with no json schema

I do believe that makes https://github.com/mozilla/bigquery-etl/issues/291 a dependency here.

fbertsch avatar Aug 15 '19 20:08 fbertsch

I'm not convinced that we gain much by actually moving the table to a historical location. I'd like to see a way of marking a docType as deprecated, perhaps via metadata in the JSON schema file itself.

Once a schema is marked as deprecated, perhaps we'd want the generated-schemas branch to include the BQ schema, but not include the JSON schema, so that the docType is no longer valid in the pipeline, but we don't remove the BQ table.

Perhaps we should have a deprecatedOn date or toBeRemovedOn date such that the schema generation machinery could automatically drop the BQ schema, and thus cause the table to be deleted, after all data in the table has expired.

This vaguely seems like the kind of metadata we would want to maintain in GCP's Data Catalog.

jklukas avatar Aug 20 '19 19:08 jklukas