telemetry-airflow icon indicating copy to clipboard operation
telemetry-airflow copied to clipboard

Specify exact container version for pipelines feeding public data

Open jklukas opened this issue 4 years ago • 1 comments

There's interest in developing some stronger safeguards around pipelines that feed publicly released data. Pipeline code generally does not live in this repository, but this repo does serve as a reasonable place for gatekeeping, since code generally has to be scheduled to run here.

To better ensure discipline that we are thinking through implications of making changes to published data, we should consider the following steps:

  • Isolate all pipelines that feed public data to a single DAG (public_data) that includes commentary about ensuring data review, etc.
  • Specify exact container versions to run within this DAG to lessen the chance of picking up code from another repository that has accidentally skipped data review
  • Protect changes to this DAG via a CODEOWNERS entry

cc @mreid-moz @scholtzan @fbertsch

jklukas avatar Apr 27 '20 15:04 jklukas

Do we need to add too much process around this?

  • Anyone with re:dash access can, right now, create a query and immediately make it public
  • Anyone with BQ access can create a query and post the results locally

I'm trying to figure out what the threat model is that we're working to alleviate. If it's to prevent accidental public data, perhaps we can manually look over the available public query weekly.

fbertsch avatar Apr 28 '20 15:04 fbertsch