soda-core
soda-core copied to clipboard
Contract identity issue
We found a problem with the lib api. When we have multiple checks we have an error that is considered an execution error. You can find the error in the source. We thought that maybe the error was being added to the logs, but the code after the log fixed the issues, so the logs shouldn't exist. For now the solution will be to add the field identity_suffix into the check. But I think this should be fixed on SODA side, or using more field to create the ID or not log the error and use the approach with index.
logs.error(f"Duplicate check identity '{check_identity}': {other_check.location} and {check_location}")
Another question: Multiple checks need the field metric to be defined; what is the use for that field? Will that field always be mandatory?
SAS-3274
The problem with the current identity generation is that dataset level checks often will need an identity_suffix. We need to improve the default behavior so that in less circumstances the user needs to fiddle with the identity.
Goal: The goal is create a better correlation mechanism between checks in the files and checks in Soda Cloud. We want to impose the least amount of burden on the user and avoid the need for users to configure correlation ids themselves. But we also want to ensure that users can edit contract files freely and move checks around while still preserving the identity. Not preserving the identity causes a new check on Soda Cloud being created, loosing the metric history and historic check results.
Proposed solution 1:
The check identity is composed of
- scope
-
warehouse_name
-
schema_name
-
dataset_name
-
column_name
- check type
-
- check type specific correlation properties
-
expression_sql
hash formetric_expression
checks -
query_sql
hash formetric_query
checks
-
- user defined
correlation_id
Users then only have to specify a correlation_id
if the scope and check type specific correlation properties do not distinctly identify a check in the source YAML file.
Proposed solution 2:
Check identity is a composition of
-
warehouse_name
-
schema_name
-
dataset_name
-
column_name
- check type
- check name
We can use the check name to provide uniqueness. But in that case, users have to know that changing the name potentially will change the check id and hence break history in Soda Cloud.
In the hellofresh use case, the name is handled as a unique ID, and the user must fill it. I think the solution 2 can be the best one. We can write our docs explaining that the name is also linked to the soda cloud, and by changing the name, the history will be lost. But I think the users expect that because it is a unique ID for us.
More analysis notes:
Solution 2 would also work best on the Soda Cloud backend.
Based on the correlation properties, contracts lib should create an identity
property in the genrated SodaCL using a hash (and not the full property serialized text)
We can offer a renaming check workflow through an name_deprecated
property. That should translate to an identity_deprecated
in SodaCL. Soda Cloud backend work should be planned to support the identity migration. The changes needed in Soda Cloud backend should not be too big as the data model is matching this strategy.