great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

[feature] substitution syntax is not working for table_name attribute of fluent_datasources (in Athena)

Open hagai-arad opened this issue 1 year ago • 7 comments

Describe the bug When using a substitution syntax in great_expectations.yml (e.g. ${my_value}) in the table_name attribute of a data asset - it doesn't substitute by the env var value nor the value given in uncommitted/config_variables.yml. Instead, it uses the string itself as the table name when running queries against this data asset: select count(*) from ${table_name}.

To Reproduce great_expectations.yml config:

# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
#   - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
#   - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3.0

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources: {}

# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GX encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials
  
config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    class_name: EvaluationParameterStore
  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

  profiler_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: profilers/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
  # Data Docs make it simple to visualize data quality in your project. These
  # include Expectations, Validations & Profiles. The are built for all
  # Datasources from JSON artifacts in the local repo including validations &
  # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

anonymous_usage_statistics:
  enabled: true
  data_context_id: 89ae52ad-b736-43d8-a4b5-2a91d10041a5
fluent_datasources:
  my_datasource:
    type: sql
    assets:
      my_asset:
        type: table
        order_by: []
        batch_metadata: {}
        table_name: ${TABLE_PREFIX}_my_table
        schema_name:
    connection_string: awsathena+rest://@athena.${REGION}.amazonaws.com/{SCHEMA_NAME}?s3_staging_dir=s3://${S3_BUCKET_NAME}/{SCHEMA_NAME}/staging/
notebooks:
include_rendered_content:
  globally: false
  expectation_validation_result: false
  expectation_suite: false

The executed code:

import great_expectations as gx

context = gx.get_context(context_root_dir=GX_ROOT_DIR)

result = context.run_checkpoint(checkpoint_name="my_checkpoint")

Full stack trace:

An SQL execution Exception occurred.  OperationalError: "(pyathena.error.OperationalError) TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist
[SQL: SELECT count(*) AS "table.row_count"
FROM (SELECT *
FROM "${TABLE_PREFIX}_my_table"
WHERE true) AS anon_1]
(Background on this error at: https://sqlalche.me/e/20/e3q8)".  
Traceback: "Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 921, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.9/site-packages/pyathena/cursor.py", line 121, in execute
    raise OperationalError(query_execution.state_change_reason)
pyathena.error.OperationalError: TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/sqlalchemy_execution_engine.py", line 1112, in resolve_metric_bundle
    res = self.execute_query(sa_query_object).fetchall()
  File "/usr/local/lib/python3.9/site-packages/great_expectations/execution_engine/sqlalchemy_execution_engine.py", line 1415, in execute_query
    result = connection.execute(query)
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1412, in execute
    return meth(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 483, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1635, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1844, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1984, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2339, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1965, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 921, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/local/lib/python3.9/site-packages/pyathena/cursor.py", line 121, in execute
    raise OperationalError(query_execution.state_change_reason)
sqlalchemy.exc.OperationalError: (pyathena.error.OperationalError) TABLE_NOT_FOUND: line 3:6: Table 'awsdatacatalog.my_schema.${TABLE_PREFIX}_my_table' does not exist
[SQL: SELECT count(*) AS "table.row_count"
FROM (SELECT *
FROM "${TABLE_PREFIX}_my_table"
WHERE true) AS anon_1]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Expected behavior The ${TABLE_PREFIX} should have been replaced with the value stored in uncommitted/config_variables.yml or by the value of the TABLE_PREFIX env var. The substitution works in all other places in my great_expectations.yml file (e.g. connection_string).

Environment (please complete the following information):

  • Operating System: Linux
  • Great Expectations Version: 0.17.1
  • Datasource: Athena
  • Cloud environment: Airflow, AWS

hagai-arad avatar Jul 17 '23 15:07 hagai-arad

Hey @hagai-arad ! Thanks for raising this. We'll investigate what may be happening here.

austiezr avatar Jul 18 '23 17:07 austiezr

This is working as intended. We don't support config substitutions for arbitrary config fields for the Fluent Datasources. We only do it for things like passwords/credentials, connection_string etc.

I don't think this is well-documented.

This might be something we could add support for as part of a new feature request.

Kilo59 avatar Aug 01 '23 02:08 Kilo59

@Kilo59 Thanks. Should I open a new feature request? Not sure how it works

hagai-arad avatar Aug 01 '23 07:08 hagai-arad

@hagai-arad yes, sorry for the slow response. Or you could edit this issue and title to highlight that it's a feature request.

It should be simple to implement. If you're interested, I can guide you to implement this yourself.

Kilo59 avatar Aug 15 '23 12:08 Kilo59

@Kilo59 Hey, I would like to get your guidance. How can we proceed from here? Thanks!

hagai-arad avatar Aug 16 '23 08:08 hagai-arad

Sorry again @hagai-arad I missed the notification that you responded.

Our Datasource and DataAsset classes are pydantic models. You'll need to update the annotation on the TableAsset.name field to be a Union[ConfigStr, str] this will cause pydantic to coerce the value as a ConfigStr and only if it cannot (there's no ${} syntax) read it as a normal string.

https://docs.pydantic.dev/1.10/usage/types/#unions

https://github.com/great-expectations/great_expectations/blob/25dd66ea30d865f0c994ac50e183bef60d70c3cc/great_expectations/datasource/fluent/sql_datasource.py#L837

That should be enough by itself to get your desired behavior.

Kilo59 avatar Aug 29 '23 14:08 Kilo59

Hi @Kilo59, Is this issue still up for grabs? If so I'd be happy to give it a go.

I have a similar use case where I want to replace the project_id of a query with a value in the config.yaml:

  bigquery_datasource:
    type: sql
    assets:
      customer:
        type: query
        order_by: []
        batch_metadata: {}
        query: select * from ${project_id}.dataset.customer

From what I can tell I would have to make the same annotation change you suggest above here? https://github.com/great-expectations/great_expectations/blob/25dd66ea30d865f0c994ac50e183bef60d70c3cc/great_expectations/datasource/fluent/sql_datasource.py#L805

slocoro avatar Feb 23 '24 14:02 slocoro