great_expectations
great_expectations copied to clipboard
multiple sample_using methods fail for BigQuery datasource
Describe the bug When using the sampling_methods:
-
sample_using_random
& -
sample_using_mod
& -
sample_using_md5
using a BigQuery datasource, the DataAssistant fails with errors. This occurs because the SQL code generated when using those sampling methods includes the SQL functionsrandom()
andmod()
. Those functions are not valid functions in the BigQuery SQL dialect.
sample_using_random
There was a previous issue opened for this, but it has since been closed without resolution: https://github.com/great-expectations/great_expectations/issues/4429 i.e. This is the SQL generated by the Data Assistant. You can see that :
SELECT count(*) AS `table_row_count`
FROM (SELECT *
FROM `bq_project.schema.table`
WHERE EXTRACT(year FROM `_date`) = @`param_1` AND EXTRACT(month FROM `_date`) = @`param_2` AND EXTRACT(day FROM `_date`) = @`param_3` ORDER BY random()
LIMIT @`param_4`) AS `great_expectations_sub_selection`
BQ throws the following Error message:
Function not found: random; Did you mean rand?
Based on this information, it seems this can be fixed by recognizing when the sample method is using BQ, and pass in a BQ dialect specific rand
function, and not the generic SQL random()
.
The line of code where this issue is occurring is: https://github.com/great-expectations/great_expectations/blob/b3761729b156367229b5cd8895d225cb13d3267a/great_expectations/execution_engine/split_and_sample/sqlalchemy_data_sampler.py#L174
Using mod
There are similar SQL dialect issues when usingsample_using_mod
. This method injects the mod symbol %
into the SQL. This is not a supported function in the BQ dialect of SQL. B! requires mod to use the function MOD(param1, param2)
Using md5
There are a couple of errors with the md5 function. Firstly, the doc describes the function name as sample_using_hash
. However, that is not a valid function for SQLAlchemy based runs. Instead, the correct name in sample_using_md5
. I believe the correct action would be to change the function name to sample_using_hash
to align with the other execution engines, as well as the documentation.
There is also an issue with execution on BQ. The problem lies in the generated function. In this example, I replaced the hash_digits = 1, and hash_value = 'f'. I manually replaced them in the query for clarity:
SELECT count(*) AS `table_row_count`
FROM (SELECT *
FROM `project.schema.table`
WHERE EXTRACT(year FROM `_date`) = @`param_1` AND EXTRACT(month FROM `_date`) = @`param_2` AND EXTRACT(day FROM `_date`) = @`param_3` AND right(md5(CAST(`column_name` AS STRING)), 1) = 'f') AS `great_expectations_sub_selection`
The error thrown is:
No matching signature for operator = for argument types: BYTES, STRING. STRING and BYTES are different types that are not directly comparable.
This error is directly related to the return types of BQ in certain functions. The md5()
function returns type BYTES, and the right()
method will return type BYTES
or STRING
depending on the input. Since md5()
always returns BYTES
, then right()
always returns BYTES
. The 'f' parameter that is being passed into the query is of type STRING
. BigQuery does not allow =
comparisons between BYTES
and STRINGS
. Therefore, the query returns an error.
https://cloud.google.com/bigquery/docs/reference/standard-sql/hash_functions#md5
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#right
Fixing this error may be possible by wrapping the right
statement in a cast to string function or the hash_value
into bytes.
https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_functions#cast_as_bytes
To Reproduce Steps to reproduce the behavior:
- Create a BQ datasource connection that uses the
sampling_method
:sample_using_random
- Run the DataAssistant on that new Data Connection
- See error
Expected behavior
This sample_using_random should recognize that this is a Bigquery connection, and substitute the appropriate random
function for the bq dialect. In this case it is rand
. If that were done, the DataAssistant would run and execute correctly.
This doesn't just apply to the DataAssistant, but all expectations run on a sample_using_random
configured BigQuery Dataset.
Environment (please complete the following information):
- Operating System:
Debian GNU/Linux 11 (bullseye)
- Great Expectations Version:
0.15.25
- Python Version:
Python 3.9.14
-
sqlalchemy-bigquery==1.4.3
Additional context Add any other context about the problem here.
Howdy @rbertsche :wave: thank you for raising this with us and being a part of our Community :bow:
We'll review this wiht the team, thank you so very much for bringing this to our attention and your patience