soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

Soda core 3.0.3+ introduced breaking change for count tests

Open edreispi opened this issue 3 years ago • 7 comments

Hello,

We have been using soda-core-snowflake 3.0.1 internally for quality checks and recently decided to test using 3.0.7. We found the following breaking change when trying to make a test using a column that is numeric in Snowflake:

When doing a missing_count > 0:

Soda 3.0.1:

SELECT 
  COUNT(*),
  COUNT(CASE WHEN NUMERIC_COLUMN IS NULL THEN 1 END),
  COUNT(CASE WHEN NUMERIC_COLUMN IS NULL THEN 1 END),
  COUNT(CASE WHEN NOT (NUMERIC_COLUMN IS NULL) AND NOT (REGEXP_LIKE(NUMERIC_COLUMN, '^ *[-+]? *[0-9]+ *$')) THEN 1 END) 
FROM schema.table

Soda 3.0.3 and up:

SELECT 
  COUNT(*),
  COUNT(CASE WHEN NUMERIC_COLUMN IS NULL THEN 1 END),
  COUNT(CASE WHEN NUMERIC_COLUMN IS NULL THEN 1 END),
  COUNT(CASE WHEN NOT (NUMERIC_COLUMN IS NULL) AND NOT (REGEXP_LIKE(COLLATE(NUMERIC_COLUMN, ''), '^ *[-+]? *[0-9]+ *$')) THEN 1 END) 
FROM schema.table

The collate breaks down in Snowflake without casting the numeric field to a string.

Release logs seem to point to @ScottAtDisney as the contributor to this change, not sure if @m1n0 has any context on this change as well.

If these tests were never meant for numeric columns it might be a good idea to update the documentation on this.

Thank you!

edreispi avatar Sep 13 '22 22:09 edreispi

SODA-1153

jmarien avatar Sep 13 '22 22:09 jmarien

Collation allows for different encoding and rules for string behaviors in Snowflake. Regular expressions do not support collation at this time so the patch is to remove the collation then attempt the regular expression.

If I understand your use case, you are attempting to use a regular expression on a numeric column, which is curious. If that is your intent, I'd recommend explicitly converting the numeric field to a varchar with TO_VARCHAR then Snowflake will be able to perform the regular expression on the the string.

scott-fought avatar Sep 14 '22 01:09 scott-fought

@ScottAtDisney I might not have been clear enough, the SQL queries shown are the output of Soda, not a custom check made by us. The regex is part of soda core. The collate might work in Redshift and Big Query but it creates this issue when using numeric columns in Snowflake. Adding the to_varchar in soda-core might be an easy fix for this, but I am left wondering why the change was made and if it was tested thoroughly with Snowflake.

edreispi avatar Sep 14 '22 19:09 edreispi

@ScottAtDisney I see you worked on https://github.com/sodadata/soda-core/pull/1516, which seems to have introduced this COLLATE behavior, which I believe is used by many of the checks.

edreispi avatar Sep 14 '22 19:09 edreispi

Are you using a metric argument like,

  • missing values
  • missing regex
  • missing format

Could you please share a minimal check file, the data type of the column and some sample data that creates this issue?

scott-fought avatar Sep 14 '22 19:09 scott-fought

In our particular case it fails with the invalid count check. Here is an example file that fails:

checks for EXAMPLE_TABLE:

  - row_count > 0:
      name: The table has more than 0 rows

  - duplicate_count(VARCHAR_COLUMN) = 0:
      name: The table has no duplicates

  - invalid_count(NUMERIC_COLUMN) = 0:
      valid format: integer
      name: The NUMERIC_COLUMN has no wrong values

Sample Data:

create table EXAMPLE_TABLE (NUMERIC_COLUMN NUMBER(3,0), VARCHAR_COLUMN varchar(6));

insert into EXAMPLE_TABLE values (123, 'abcdef'), (456, 'ghijk'), (789, 'lmnop');

Running with soda scan -d default -c configuration.yml EXAMPLE_TABLE.yml, using soda 3.0.7, I get the following output (slightly truncated):

Query execution error in default.EXAMPLE_TABLE.aggregation[0]: 001045 (22023): SQL compilation error:
argument needs to be a string: 'EXAMPLE_TABLE.NUMERIC_COLUMN'
SELECT 
  COUNT(*),
  COUNT(CASE WHEN NOT (NUMERIC_COLUMN IS NULL) AND NOT (REGEXP_LIKE(COLLATE(NUMERIC_COLUMN, ''), '^ *[-+]? *[0-9]+ *$')) THEN 1 END) 
FROM EXAMPLE_TABLE
  | 001045 (22023): SQL compilation error:
  | argument needs to be a string: 'EXAMPLE_TABLE.NUMERIC_COLUMN'
Metrics row_count were not computed for check row_count > 0
Metrics invalid_count were not computed for check invalid_count(NUMERIC_COLUMN) = 0
Scan summary:
1/3 checks PASSED: 
    EXAMPLE_TABLE in default
      The table has no duplicates [PASSED]
2/3 checks NOT EVALUATED: 
    EXAMPLE_TABLE in default
      The table has more than 0 rows [NOT EVALUATED]
        check_value: None
      The NUMERIC_COLUMN has no wrong values [NOT EVALUATED]
        check_value: None
2 checks not evaluated.
3 errors.
Oops! 3 errors. 0 failures. 0 warnings. 1 pass.
ERRORS:
Query execution error in default.EXAMPLE_TABLE.aggregation[0]: 001045 (22023): SQL compilation error:
argument needs to be a string: 'EXAMPLE_TABLE.NUMERIC_COLUMN'
SELECT 
  COUNT(*),
  COUNT(CASE WHEN NOT (NUMERIC_COLUMN IS NULL) AND NOT (REGEXP_LIKE(COLLATE(NUMERIC_COLUMN, ''), '^ *[-+]? *[0-9]+ *$')) THEN 1 END) 
FROM EXAMPLE_TABLE
  | 001045 (22023): SQL compilation error:
  | argument needs to be a string: 'EXAMPLE_TABLE.NUMERIC_COLUMN'
Metrics row_count were not computed for check row_count > 0
Metrics invalid_count were not computed for check invalid_count(NUMERIC_COLUMN) = 0

The second test works, the third one causes the issue, and the first one is not evaluated.

When running with Soda-core 3.0.2, I get the following output:

Scan summary:
3/3 checks PASSED: 
    EXAMPLE_TABLE in default
      The table has more than 0 rows [PASSED]
      The table has no duplicates [PASSED]
      The NUMERIC_COLUMN has no wrong values [PASSED]
All is good. No failures. No warnings. No errors.

Let me know if more example output would help or if you would like me to test in a different way.

Thanks!

edreispi avatar Sep 14 '22 20:09 edreispi

Got it. That's really helpful.

The Validity Metric doc states,

Valid formats apply only to columns using data type TEXT.

The NUMERIC_COLUMN is defined as NUMBER(3,0) so the (valid format: integer) metric would be out of scope as I understand the definition.

Since the data type of that column (NUMBER(3,0)) is enforced by the DB and the check is looking for an INTEGER, even if the check worked, it would always return passed.

A valid min/max metric could be used to check the range of integer values, but that column can only hold integers.

scott-fought avatar Sep 14 '22 20:09 scott-fought

Thank you for your help @ScottAtDisney. It is true that you cannot mix a check for a NUMBER field (which is enforced by the DB already) with one of Soda's validity checks, intended for a TEXT or VARCHAR field. Will close the issue.

edreispi avatar Oct 18 '22 17:10 edreispi