soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

Issue with failed rows for checks like duplicates

Open albinkjellin opened this issue 2 years ago • 7 comments

This is causing the sync with Soda Cloud.

albinkjellin avatar May 23 '22 14:05 albinkjellin

SODA-559

jmarien avatar May 23 '22 14:05 jmarien

I was able to replicate this issue using the reference data check:

checks for retail_customers:
  - values in country_code must exist in ref_countries iso:
    name: Ensure valid country codes

Gives this output:

(soda-cl-v3.0.0b15) $soda scan -d aws_postgres_retail uploaderror.yml
Soda Core 3.0.0b15
Empty file upload detected, not sending Content-Length header
No fileId received in response: {'code': 'invalid_empty_upload', 'message': 'File uploads may not be empty'}
Soda cloud error: Could not upload sample failed_rows
  | 'fileId'
Scan summary:
1/1 check PASSED: 
    retail_customers in aws_postgres_retail
      values in country_code must exist in ref_countries iso [PASSED]
1 errors.
Oops! 1 error. 0 failures. 0 warnings. 1 pass.
ERRORS:
Soda cloud error: Could not upload sample failed_rows
  | 'fileId'
Sending results to Soda Cloud

albinkjellin avatar May 24 '22 09:05 albinkjellin

Another example:

checks for dim_gift:
   - row_count = 551573989
   - duplicate_count(transaction_id) = 0

Results in this:

soda scan -V -d dwh_2020 -c C:\Users\dasher\.soda\configuration.yml C:\Users\dasher\soda_bigquery\checks.yml
Soda Core 3.0.0b15
Reading configuration file "C:\Users\dasher\.soda\configuration.yml"
Reading SodaCL file "C:\Users\dasher\soda_bigquery\checks.yml"
Scan execution starts
C:\Users\dasher\.venv\lib\site-packages\google\cloud\bigquery\client.py:535: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
  warnings.warn(
Query dwh_2020.dim_gift.aggregation[0]:
SELECT
  COUNT(*)
FROM dim_gift
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Making request: POST https://oauth2.googleapis.com/token
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Query dwh_2020.dim_gift.transaction_id.duplicate_count:
WITH frequencies AS (
  SELECT transaction_id, COUNT(*) AS frequency
  FROM dim_gift
  WHERE transaction_id IS NOT NULL
  GROUP BY transaction_id)
SELECT *
FROM frequencies
WHERE frequency > 1;
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Empty file upload detected, not sending Content-Length header
No fileId received in response: {'code': 'invalid_empty_upload', 'message': 'File uploads may not be empty'}
Soda cloud error: Could not upload sample dim_gift_transaction_id_failed_rows
  | 'fileId'
  | Stacktrace:
  | Traceback (most recent call last):
  |   File "C:\Users\dasher\.venv\lib\site-packages\soda\soda_cloud\soda_cloud.py", line 113, in upload_sample
  |     file_id = self._upload_sample_http(scan_definition_name, file_path, temp_file, temp_file_size_in_bytes)
  |   File "C:\Users\dasher\.venv\lib\site-packages\soda\soda_cloud\soda_cloud.py", line 143, in _upload_sample_http
  |     return upload_response_json["fileId"]
  | KeyError: 'fileId'
Scan summary:
2/2 queries OK
  dwh_2020.dim_gift.aggregation[0] [OK] 0:00:01.630912
  dwh_2020.dim_gift.transaction_id.duplicate_count [OK] 0:00:01.610153
1/2 checks PASSED:
    dim_gift in dwh_2020
      duplicate_count(transaction_id) = 0 [PASSED]
        check_value: 0
        failed_rows_sample_ref: soda_cloud 2x(0/0)
1/2 checks FAILED:
    dim_gift in dwh_2020
      row_count = 551573989 [FAILED]
        check_value: 438952718
1 errors.
Oops! 1 error. 1 failures. 0 warnings. 1 pass.
ERRORS:
Soda cloud error: Could not upload sample dim_gift_transaction_id_failed_rows
  | 'fileId'
  | Stacktrace:
  | Traceback (most recent call last):
  |   File "C:\Users\dasher\.venv\lib\site-packages\soda\soda_cloud\soda_cloud.py", line 113, in upload_sample
  |     file_id = self._upload_sample_http(scan_definition_name, file_path, temp_file, temp_file_size_in_bytes)
  |   File "C:\Users\dasher\.venv\lib\site-packages\soda\soda_cloud\soda_cloud.py", line 143, in _upload_sample_http
  |     return upload_response_json["fileId"]
  | KeyError: 'fileId'
Sending results to Soda Cloud
Error while executing Soda Cloud command response code: 400
{
  "code": "invalid_request",
  "message": "Failed request validation on the following properties:\nchecks[1].diagnostics.failedRowsFile.reference: may not be null\nchecks[1].diagnostics.failedRowsFile.reference: may not be null"
}
Open Telemetry: Skipping non-soda span 'BigQuery.job.begin'.
Open Telemetry: Skipping non-soda span 'BigQuery.getQueryResults'.
Open Telemetry: Skipping non-soda span 'BigQuery.job.begin'.
Open Telemetry: Skipping non-soda span 'BigQuery.getQueryResults'.

albinkjellin avatar May 24 '22 15:05 albinkjellin

Already have a PR for

No fileId received in response: {'code': 'invalid_empty_upload', 'message': 'File uploads may not be empty'}

I'm not sure the fix also fixes:

Error while executing Soda Cloud command response code: 400
{
  "code": "invalid_request",
  "message": "Failed request validation on the following properties:\nchecks[1].diagnostics.failedRowsFile.reference: may not be null\nchecks[1].diagnostics.failedRowsFile.reference: may not be null"
}

I'll check that later

@albinkjellin Are these also things I should look into?

Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types
Open Telemetry: Skipping non-soda span 'BigQuery.job.begin'.
Open Telemetry: Skipping non-soda span 'BigQuery.getQueryResults'.
Open Telemetry: Skipping non-soda span 'BigQuery.job.begin'.
Open Telemetry: Skipping non-soda span 'BigQuery.getQueryResults'.

tombaeyens avatar May 25 '22 11:05 tombaeyens

Note to self: use tests/integration/test_samples_integration.py to try and reproduce the

Error while executing Soda Cloud command response code: 400
{
  "code": "invalid_request",
  "message": "Failed request validation on the following properties:\nchecks[1].diagnostics.failedRowsFile.reference: may not be null\nchecks[1].diagnostics.failedRowsFile.reference: may not be null"
}

tombaeyens avatar May 25 '22 11:05 tombaeyens

Thanks for the quick response on this! The: Invalid type NoneType for attribute value. Expected one of ['bool', 'str', 'bytes', 'int', 'float'] or a sequence of those types

Is not as urgent.

albinkjellin avatar May 25 '22 11:05 albinkjellin

@albinkjellin I couldn't reproduce Invalid type NoneType for attribute value, Is there a way you can provide the configuration.yml?

vijaykiran avatar May 27 '22 14:05 vijaykiran

@albinkjellin can you please test/check with 3.0.12 and re-open if this is still not working?

vijaykiran avatar Nov 03 '22 10:11 vijaykiran