great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

multiple sample_using methods fail for BigQuery datasource

Open rbertsche opened this issue 1 year ago • 3 comments

Describe the bug When using the sampling_methods:

  • sample_using_random &
  • sample_using_mod &
  • sample_using_md5 using a BigQuery datasource, the DataAssistant fails with errors. This occurs because the SQL code generated when using those sampling methods includes the SQL functions random() and mod(). Those functions are not valid functions in the BigQuery SQL dialect.

sample_using_random

There was a previous issue opened for this, but it has since been closed without resolution: https://github.com/great-expectations/great_expectations/issues/4429 i.e. This is the SQL generated by the Data Assistant. You can see that :

SELECT count(*) AS `table_row_count` 
FROM (SELECT * 
FROM `bq_project.schema.table` 
WHERE EXTRACT(year FROM `_date`) = @`param_1` AND EXTRACT(month FROM `_date`) = @`param_2` AND EXTRACT(day FROM `_date`) = @`param_3` ORDER BY random()
 LIMIT @`param_4`) AS `great_expectations_sub_selection`

BQ throws the following Error message:

Function not found: random; Did you mean rand?

Based on this information, it seems this can be fixed by recognizing when the sample method is using BQ, and pass in a BQ dialect specific rand function, and not the generic SQL random().

The line of code where this issue is occurring is: https://github.com/great-expectations/great_expectations/blob/b3761729b156367229b5cd8895d225cb13d3267a/great_expectations/execution_engine/split_and_sample/sqlalchemy_data_sampler.py#L174

Using mod

There are similar SQL dialect issues when usingsample_using_mod. This method injects the mod symbol % into the SQL. This is not a supported function in the BQ dialect of SQL. B! requires mod to use the function MOD(param1, param2)

Using md5

There are a couple of errors with the md5 function. Firstly, the doc describes the function name as sample_using_hash. However, that is not a valid function for SQLAlchemy based runs. Instead, the correct name in sample_using_md5. I believe the correct action would be to change the function name to sample_using_hash to align with the other execution engines, as well as the documentation.

There is also an issue with execution on BQ. The problem lies in the generated function. In this example, I replaced the hash_digits = 1, and hash_value = 'f'. I manually replaced them in the query for clarity:

SELECT count(*) AS `table_row_count` 
FROM (SELECT * 
FROM `project.schema.table` 
WHERE EXTRACT(year FROM `_date`) = @`param_1` AND EXTRACT(month FROM `_date`) = @`param_2` AND EXTRACT(day FROM `_date`) = @`param_3` AND right(md5(CAST(`column_name` AS STRING)), 1) = 'f') AS `great_expectations_sub_selection`

The error thrown is: No matching signature for operator = for argument types: BYTES, STRING. STRING and BYTES are different types that are not directly comparable.

This error is directly related to the return types of BQ in certain functions. The md5() function returns type BYTES, and the right() method will return type BYTES or STRING depending on the input. Since md5() always returns BYTES, then right() always returns BYTES. The 'f' parameter that is being passed into the query is of type STRING. BigQuery does not allow = comparisons between BYTES and STRINGS. Therefore, the query returns an error. https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions#md5 https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#right

Fixing this error may be possible by wrapping the right statement in a cast to string function or the hash_value into bytes. https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_functions#cast_as_bytes

To Reproduce Steps to reproduce the behavior:

  1. Create a BQ datasource connection that uses the sampling_method: sample_using_random
  2. Run the DataAssistant on that new Data Connection
  3. See error

Expected behavior This sample_using_random should recognize that this is a Bigquery connection, and substitute the appropriate random function for the bq dialect. In this case it is rand. If that were done, the DataAssistant would run and execute correctly.

This doesn't just apply to the DataAssistant, but all expectations run on a sample_using_random configured BigQuery Dataset.

Environment (please complete the following information):

  • Operating System: Debian GNU/Linux 11 (bullseye)
  • Great Expectations Version: 0.15.25
  • Python Version: Python 3.9.14
  • sqlalchemy-bigquery==1.4.3

Additional context Add any other context about the problem here.

rbertsche avatar Oct 05 '22 21:10 rbertsche

Howdy @rbertsche :wave: thank you for raising this with us and being a part of our Community :bow:

We'll review this wiht the team, thank you so very much for bringing this to our attention and your patience

AFineDayFor avatar Oct 12 '22 15:10 AFineDayFor