great_expectations
great_expectations copied to clipboard
[feature] substitution syntax is not working for table_name attribute of fluent_datasources (in Athena)
Describe the bug
When using a substitution syntax in great_expectations.yml
(e.g. ${my_value}
) in the table_name attribute of a data asset - it doesn't substitute by the env var value nor the value given in uncommitted/config_variables.yml
. Instead, it uses the string itself as the table name when running queries against this data asset: select count(*) from ${table_name}
.
To Reproduce great_expectations.yml config:
# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
# - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
# - Join our slack channel: http://greatexpectations.io/slack
# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3.0
# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources: {}
# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GX encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials
config_variables_file_path: uncommitted/config_variables.yml
# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/
stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
validations_store:
class_name: ValidationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
evaluation_parameter_store:
class_name: EvaluationParameterStore
checkpoint_store:
class_name: CheckpointStore
store_backend:
class_name: TupleFilesystemStoreBackend
suppress_store_backend_id: true
base_directory: checkpoints/
profiler_store:
class_name: ProfilerStore
store_backend:
class_name: TupleFilesystemStoreBackend
suppress_store_backend_id: true
base_directory: profilers/
expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store
data_docs_sites:
# Data Docs make it simple to visualize data quality in your project. These
# include Expectations, Validations & Profiles. The are built for all
# Datasources from JSON artifacts in the local repo including validations &
# profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
anonymous_usage_statistics:
enabled: true
data_context_id: 89ae52ad-b736-43d8-a4b5-2a91d10041a5
fluent_datasources:
my_datasource:
type: sql
assets:
my_asset:
type: table
order_by: []
batch_metadata: {}
table_name: ${TABLE_PREFIX}_my_table
schema_name:
connection_string: awsathena+rest://@athena.${REGION}.amazonaws.com/{SCHEMA_NAME}?s3_staging_dir=s3://${S3_BUCKET_NAME}/{SCHEMA_NAME}/staging/
notebooks:
include_rendered_content:
globally: false
expectation_validation_result: false
expectation_suite: false
The executed code:
import great_expectations as gx
context = gx.get_context(context_root_dir=GX_ROOT_DIR)
result = context.run_checkpoint(checkpoint_name="my_checkpoint")
Full stack trace:
An SQL execution Exception occurred. OperationalError: "(pyathena.error.OperationalError) TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist
[SQL: SELECT count(*) AS "table.row_count"
FROM (SELECT *
FROM "${TABLE_PREFIX}_my_table"
WHERE true) AS anon_1]
(Background on this error at: https://sqlalche.me/e/20/e3q8)".
Traceback: "Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
self.dialect.do_execute(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 921, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib/python3.9/site-packages/pyathena/cursor.py", line 121, in execute
raise OperationalError(query_execution.state_change_reason)
pyathena.error.OperationalError: TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/sqlalchemy_execution_engine.py", line 1112, in resolve_metric_bundle
res = self.execute_query(sa_query_object).fetchall()
File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/sqlalchemy_execution_engine.py", line 1415, in execute_query
result = connection.execute(query)
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1412, in execute
return meth(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 483, in _execute_on_connection
return connection._execute_clauseelement(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1635, in _execute_clauseelement
ret = self._execute_context(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1844, in _execute_context
return self._exec_single_context(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1984, in _exec_single_context
self._handle_dbapi_exception(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2339, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
self.dialect.do_execute(
File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 921, in do_execute
cursor.execute(statement, parameters)
File "/usr/local/lib/python3.9/site-packages/pyathena/cursor.py", line 121, in execute
raise OperationalError(query_execution.state_change_reason)
sqlalchemy.exc.OperationalError: (pyathena.error.OperationalError) TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist
[SQL: SELECT count(*) AS "table.row_count"
FROM (SELECT *
FROM "${TABLE_PREFIX}_my_table"
WHERE true) AS anon_1]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Expected behavior
The ${TABLE_PREFIX}
should have been replaced with the value stored in uncommitted/config_variables.yml
or by the value of the TABLE_PREFIX
env var. The substitution works in all other places in my great_expectations.yml file (e.g. connection_string).
Environment (please complete the following information):
- Operating System: Linux
- Great Expectations Version: 0.17.1
- Datasource: Athena
- Cloud environment: Airflow, AWS
Hey @hagai-arad ! Thanks for raising this. We'll investigate what may be happening here.
This is working as intended.
We don't support config substitutions for arbitrary config fields for the Fluent Datasources. We only do it for things like passwords/credentials, connection_string
etc.
I don't think this is well-documented.
This might be something we could add support for as part of a new feature request.
@Kilo59 Thanks. Should I open a new feature request? Not sure how it works
@hagai-arad yes, sorry for the slow response. Or you could edit this issue and title to highlight that it's a feature request.
It should be simple to implement. If you're interested, I can guide you to implement this yourself.
@Kilo59 Hey, I would like to get your guidance. How can we proceed from here? Thanks!
Sorry again @hagai-arad I missed the notification that you responded.
Our Datasource and DataAsset classes are pydantic models.
You'll need to update the annotation on the TableAsset.name
field to be a Union[ConfigStr, str]
this will cause pydantic to coerce the value as a ConfigStr
and only if it cannot (there's no ${}
syntax) read it as a normal string.
https://docs.pydantic.dev/1.10/usage/types/#unions
https://github.com/great-expectations/great_expectations/blob/25dd66ea30d865f0c994ac50e183bef60d70c3cc/great_expectations/datasource/fluent/sql_datasource.py#L837
That should be enough by itself to get your desired behavior.
Hi @Kilo59, Is this issue still up for grabs? If so I'd be happy to give it a go.
I have a similar use case where I want to replace the project_id
of a query with a value in the config.yaml
:
bigquery_datasource:
type: sql
assets:
customer:
type: query
order_by: []
batch_metadata: {}
query: select * from ${project_id}.dataset.customer
From what I can tell I would have to make the same annotation change you suggest above here? https://github.com/great-expectations/great_expectations/blob/25dd66ea30d865f0c994ac50e183bef60d70c3cc/great_expectations/datasource/fluent/sql_datasource.py#L805
Hello @hagai-arad. With the upcoming launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.
To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).
You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.
Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗