soda-core
soda-core copied to clipboard
Review generated check names
@mathissedestrooper or Benjamin should review the check names as generated by Soda Core.
By default, check names are the main check line.
For metric checks having nested threshold level specification:
Given the checks.yml configuration:
checks for bitcoin:
- min(usd):
warn: when < 20000
fail: when < 10000
expected check name:
min(usd) fails when < 10000
See also https://github.com/sodadata/soda-core/issues/1154
@janet-can if you're able to assist, i'll describe a bit more in detail what the goal of this issue is. The goal is to provide Benjamin & Mathisse with an overview of how Soda Core generates check names. I ll explain the simple logic on how the check names are created. Than that logic needs to be applied on all the different check types as explained in the docs to get a good overview so that Benjamin and Mathisse can review.
Here's the python logic:
def generate_soda_cloud_check_name(self) -> str:
if self.check_cfg.name:
return self.check_cfg.name
name = self.check_cfg.source_line
if self.check_cfg.source_configurations:
source_cfg = dict(self.check_cfg.source_configurations)
if source_cfg.get("warn"):
name += f" warn {source_cfg['warn']}"
if source_cfg.get("fail"):
name += f" fail {source_cfg['fail']}"
return name
So we always take the check line. And if there is a nested warn or fail specified, those are added as indicated.
Next step is to go over the docs and compile a list of example checks. Next is for someone to read this logic and apply it manually to give an idea about the corresponding check name.
Maybe it's primitive but i ll start a table where the check names can be entered. Then just edit the comment to add more examples.
@tombaeyens sounds great.
FYI: @janet-can is helping @benjamin-pirotte and me in creating the acceptance criteria for each check type, we should include the expected name in this exercise so that we can do a comparison with the currently generated names and make changes accordingly. Sounds good?
Here's an example of how the check names could be documented for review. Just a suggestion so if you know a better way feel free.
checks for PRODUCTS:
# row_count between 10 and 1000
- row_count between 10 and 1000
# invalid_count(unit_cost) = 0
- invalid_count(unit_cost) = 0
valid min: 0.01
valid max: 10000
# invalid_percent(movement_date) warn when > 1%
- invalid_percent(movement_date):
valid format: date us
warn: when > 1%
This is an interesting exercise! Adding @benjamin-pirotte to the convo. :)
What you've described above makes sense to me, that to generate a check name that can be displayed in Soda Cloud, we need a formula that grabs the right bits of information from the check and pieces them together in a way that will make sense in a readable, UI format. And the nested alert part makes sense, too, as it effectively communicates the threshold of the check, just in the context of an alert.
Comments/questions:
- Because we'll eventually get to a point where both Soda Cloud and Soda Core users are using the same YAML-based language to write their checks, transferring a plain, one-line check from SodaCL to "readable UI format" likely doesn't need much translation. I might argue that where column names form part of a check, we could do something to massage the check name a bit, if possible, so that
invalid_count(unit_cost) = 0becomes "invalid_count for unit_cost = 0". - Nested validity format might be a bit tricky to add to a check name. The check name indicates a threshold, but not what constitutes valid/invalid/missing. Perhaps it's not worth adding that to a check name because it could eventually become a super long string. Also, I just noticed that when you click on Monitor Info in the Monitor Results table, you can see the raw check YAML which would give you what you need to know for valid/missing format info. cool!
- Should we consider globally defined valid/missing formats when it comes to check names? I guess if we don't include that info in the name itself (too long!) then, no. Is there any way we could point to globals inside the little YAML window in the Monitor Info panel? Or just indicate that global definitions exist?
- What happens when there are several nested configurations? In such a case, would the check name capture all nested items? That could get to be a really long string. Plus the check name would really need to include the column names to be usable in the UI. Like, "schema fail when customer_id or customer_last_name missing or when forbidden column credit_card exists and warn when wrong column type time for date or character varying for name" And even then, whenever that check fails or warns, you wouldn't know what part of the check triggered the fail or warn.
checks for orders:
- schema:
fail:
when required column missing: [customer_id, customer_last_name]
when forbidden column exists: [credit_card]
warn:
when wrong column type:
date: time
name: character varying
- When a variable is included in the check, how does the check name handle that? Example:
- freshness using row_added_ts with DATA_END_TS < 1h
- Far from critical, but when you have a
for eachcheck that runs, say, row_count checks on 8 tables and half of them fail, you would see 4 simultaneous "row count > 0" failed, but without clicking into the Monitor Info for each individual line item, you wouldn't know which tables failed the check. For a for each check name, could/should we also capture the table name to make it a bit easier on UI users? - Is it important to include any table filter or timestamp information in a check name, if applicable?
- If the check itself uses quotes for column name, should the check name also display quotes?
Suggestion:
- Say... what if we made
namea mandatory property, requiring the user to name each check themselves in anticipation of sending it to the cloud? Then all these problems go away! haha!
Far from critical, but when you have a for each check that runs, say, row_count checks on 8 tables and half of them fail, you would see 4 simultaneous "row count > 0" failed, but without clicking into the Monitor Info for each individual line item, you wouldn't know which tables failed the check. For a for each check name, could/should we also capture the table name to make it a bit easier on UI users?
Nevermind! I see just now that this has just been released as a feature in cloud. Problem solved! Please ignore this point.