elementary icon indicating copy to clipboard operation
elementary copied to clipboard

schema changes based on baseline

Open JSantora opened this issue 3 years ago • 4 comments

Currently elementary schema_change tests will fail, when it is updated to the previous state, the next run fails as well. I have a scenario where receiving inbound share from snowflake, on schema changes checks, if 3rd party lets say changes column name, the schema checks will fail and result in the monitor reflecting that. At that point we let the 3rd party know that there was unwanted change and they revert to the correct column name. When we re run schema checks it is still failing as column removed. Is there a way you could add a flag to the test "compare to dbt model schema" that will change the behaviour of the test to compare the schema to the model schema.yaml definition, could be a great feature!

JSantora avatar Sep 14 '22 04:09 JSantora

Thank you @JSantora, the use case totally makes sense and it sounds like a great test! This means implementing a new schema test, I would call it elementary.base_schema, becuase it does not test for any schema changes, but for changes against a base schema. Would you prefer the base to be from the YML file or from current state (with some "update base schema" option)? Are the columns defined in your YML today?

Maayan-s avatar Sep 14 '22 14:09 Maayan-s

Hi @Maayan-s, I think if there is a way to compare to base schema (this could be the schema of the first run) without actually having the schema in yaml file, that would be probably preferable as it would save us manually adding schemas in yaml files. Again not sure how possible that is. Currently we actually not having any schemas saved in yml. Thanks for looking into this.

JSantora avatar Sep 14 '22 22:09 JSantora

I believe this should be a new test, elementary.base_schema, as the behavior is different from our current schema changes.

I see two options for implementation of this new test:

  1. Base schema is defined in YML
  2. Base schema is the state on the test first execution

Option 1 - Base schema is defined in YML Flow: The user defines the schema in the YML, in the dbt standard format. We can also add a type property for columns. The test pulls the YML schema from the dbt graph object, and the current schema from the information schema. If there is a diff between the 2, the test will fail.

Pros:

  • dbt has a YML format for schema that we can support.
  • Easy for the user to manage the base schema, including version control.

Cons:

  • Configuration overhead if you don't document schema. This can be mitigated with a codegen macro.

Option 2 - Base schema is the state on the test first execution Flow: On the first execution, the test pulls the current schema from the information schema, and saves it to a table (per test or to an incremental base_schemas table). On each execution, current schema from the information schema is compared to the stored schema. If there is a diff between the 2, the test will fail. We would also need to implement a way to "refresh" the base schema (drop the table or delete the records in the incremental table).

Pros:

  • Less configuration, more automated flow.

Cons:

  • Hard for user to know what is the base schema in place.
  • Harder to update the base schema.

Maayan-s avatar Sep 18 '22 14:09 Maayan-s

I'm in favor of option 1 as it's less complex and more inline with what dbt users are accustomed to do already. Also the name base_schema is a bit confusing IMO. I think we should make sure it provides a clear contract on what this test validates. Maybe schema_changes_from_baseline or providing a new flag called compare_with_baseline to the existing schema_changes test

oravi avatar Sep 18 '22 15:09 oravi

This issue is stale because it has been open for too long with no activity. If you would like the issue to remain open, please remove the stale label or leave a comment.

github-actions[bot] avatar Nov 30 '22 02:11 github-actions[bot]