great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Potential Regression in 0.15.16

Open rdodev opened this issue 1 year ago • 2 comments

We've s a couple of reported issues that started happening in v0.15.16 and were fixed when users were told to user an earlier version.

Relevant Slack Threads:

  • https://greatexpectationstalk.slack.com/archives/CUTCNHN82/p1659468256392659
  • https://greatexpectationstalk.slack.com/archives/CUTCNHN82/p1659082016455689

rdodev avatar Aug 03 '22 12:08 rdodev

As per slack request here's my great_expectations.yml execution environment: AWS Glue


# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
#   - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
#   - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources:
  my_spark_dataframe:
    data_connectors:
      default_runtime_data_connector_name:
        batch_identifiers:
          - batch_id
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
    class_name: Datasource
    module_name: great_expectations.datasource
    execution_engine:
      class_name: SparkDFExecutionEngine
      module_name: great_expectations.execution_engine
      force_reuse_spark_context: true

# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GE encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials


config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
#plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.


  expectations_S3_store:
    class_name: ExpectationsStore
    store_backend:
        class_name: TupleS3StoreBackend
        bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
        prefix: validation/great_expectations/expectations

  validations_S3_store:
    class_name: ValidationsStore
    store_backend:
        class_name: TupleS3StoreBackend
        bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
        prefix: validation/great_expectations/validations

  evaluation_parameter_store:
    # Evaluation Parameters enable dynamic expectations. Read more here:
    # https://docs.greatexpectations.io/docs/reference/evaluation_parameters/
    class_name: EvaluationParameterStore


  checkpoint_S3_store:
    class_name: CheckpointStore
    store_backend:
        class_name: TupleS3StoreBackend
        bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
        prefix: validation/great_expectations/checkpoints/

  profiler_S3_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleS3StoreBackend
      suppress_store_backend_id: true
      bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
      prefix: validation/great_expectations/profilers/

expectations_store_name: expectations_S3_store
validations_store_name: validations_S3_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_S3_store

data_docs_sites:
  # Data Docs make it simple to visualize data quality in your project. These
  # include Expectations, Validations & Profiles. The are built for all
  # Datasources from JSON artifacts in the local repo including validations &
  # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
  local_site:
    class_name: SiteBuilder
    # set to false to hide how-to buttons in Data Docs
    show_how_to_buttons: false
    store_backend:
        class_name: TupleS3StoreBackend
        bucket: company-data-val-docs-${AWS_ACCOUNT_ID}
    site_index_builder:
        class_name: DefaultSiteIndexBuilder

anonymous_usage_statistics:
  enabled: True

and here's my setup script:

def setup_runtime_data_context(local=False):
    if local:
        return ge.get_context()
    s3 = boto3.client("s3")
    os.environ["AWS_ACCOUNT_ID"] = etl_utils.DbInfo._get_account_id()
    os.environ["AWS_ACCOUNT_REGION"] = etl_utils.DbInfo._get_region()
    great_expectations_yml = s3.get_object(
        Bucket=ge_expectations_yml_s3_bucket, Key=ge_expectations_yml_s3_key)["Body"]
    yml = yaml.YAML()
    cfg = yml.load(great_expectations_yml)
    ds_cfg = DatasourceConfig(
        **cfg.get("datasources").get("my_spark_dataframe"))
    dc_cfg = DataContextConfig(
        datasources={"my_spark_dataframe": ds_cfg}, stores=cfg.get("stores"))
    cfg.pop("datasources")
    cfg.pop("stores")
    dc_cfg.__dict__.update(cfg)
    return BaseDataContext(project_config=dc_cfg)

and here's the main script (I omitted the spark dataframe creation)

        batch_request = RuntimeBatchRequest(
            datasource_name="my_spark_dataframe",
            data_connector_name="default_runtime_data_connector_name",
            data_asset_name=query,
            batch_identifiers={"batch_id": "default_identifier"},
            runtime_parameters={"batch_data": df})

        context.run_checkpoint(
            checkpoint_name="daily_redshift_validation",
            batch_request=batch_request,
            expectation_suite_name=exp_suite,
            run_name=t_name
        )

As I said I had to revert to 0.15.13 in order to make it to work again, after that version it reports error:

return BaseDataContext(project_config=dc_cfg)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/core/usage_statistics/usage_statistics.py", line 294, in usage_statistics_wrapped_method
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 297, in __init__
    project_config=project_config, runtime_environment=runtime_environment
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/ephemeral_data_context.py", line 46, in __init__
    super().__init__(runtime_environment=runtime_environment)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 178, in __init__
    self._init_stores(self.project_config_with_variables_substituted.stores)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 1416, in _init_stores
    self._build_store_from_config(store_name, store_config)
  File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 1346, in _build_store_from_config
    "manually_initialize_store_backend_id": self.variables.anonymous_usage_statistics.data_context_id
AttributeError: 'dict' object has no attribute 'data_context_id'

planck-length avatar Aug 05 '22 17:08 planck-length

Hi @planck-length , I'm Will with the OSS team at Great Expectations. I really appreciate you sharing your configuration.. it's really helped with the investigation work with the issue.

So the cause of the error has to do with some additional checks we have been doing for DataContext, particularly around configurations that are now stored as DataContextVariables.

What that means that is the __dict__.update(cfg) at the end of the setup script prevents some of the Config objects, like the AnonymizedUsageStatisticsConfig from being instantiated properly, hence the AttributeError.

Here is an updated setup_runtime_data_context() script, where we load the DataContextConfig
directly from cfg and replace the .datasources attribute with the one that we want (ds_cfg). Would you be able to give it a shot and let us know how it goes?

def updated_setup_runtime_data_context(local=False):
    if local:
        return ge.get_context()
    s3 = boto3.client("s3")
    os.environ["AWS_ACCOUNT_ID"] = etl_utils.DbInfo._get_account_id()
    os.environ["AWS_ACCOUNT_REGION"] = etl_utils.DbInfo._get_region()
    great_expectations_yml = s3.get_object(
        Bucket=ge_expectations_yml_s3_bucket, Key=ge_expectations_yml_s3_key)["Body"]
    yml = yaml.YAML()
    cfg = yml.load(great_expectations_yml)
    ds_cfg = DatasourceConfig(
        **cfg.get("datasources").get("my_spark_dataframe"))
    dc_cfg = DataContextConfig(**cfg)
    master_cfg.datasources["my_new_dataframe"] = ds_cfg
    return BaseDataContext(project_config=master_cfg)

Shinnnyshinshin avatar Aug 11 '22 03:08 Shinnnyshinshin

Hello @Shinnnyshinshin I just copied your updated code and ran it on upgraded version, and it didn't failed. I'll do full check later today. Thank you for you help!

planck-length avatar Aug 12 '22 08:08 planck-length

great to hear @planck-length. Let us know how the full test goes :)

Shinnnyshinshin avatar Aug 12 '22 19:08 Shinnnyshinshin

Hi @planck-length wanting to check in on this issue. Did the script work with the full test?

Shinnnyshinshin avatar Aug 15 '22 16:08 Shinnnyshinshin

@Shinnnyshinshin hi, sorry for delay in response, yes script ran all my suites, no failure with execution. Thanks again for your help!

planck-length avatar Aug 15 '22 17:08 planck-length

Wonderful :) Glad to help. i'll close this issue now

Shinnnyshinshin avatar Aug 15 '22 17:08 Shinnnyshinshin