great_expectations
great_expectations copied to clipboard
Potential Regression in 0.15.16
We've s a couple of reported issues that started happening in v0.15.16 and were fixed when users were told to user an earlier version.
Relevant Slack Threads:
- https://greatexpectationstalk.slack.com/archives/CUTCNHN82/p1659468256392659
- https://greatexpectationstalk.slack.com/archives/CUTCNHN82/p1659082016455689
As per slack request here's my great_expectations.yml
execution environment: AWS Glue
# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
# - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
# - Join our slack channel: http://greatexpectations.io/slack
# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3
# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources:
my_spark_dataframe:
data_connectors:
default_runtime_data_connector_name:
batch_identifiers:
- batch_id
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
class_name: SparkDFExecutionEngine
module_name: great_expectations.execution_engine
force_reuse_spark_context: true
# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GE encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials
config_variables_file_path: uncommitted/config_variables.yml
# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
#plugins_directory: plugins/
stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
expectations_S3_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
prefix: validation/great_expectations/expectations
validations_S3_store:
class_name: ValidationsStore
store_backend:
class_name: TupleS3StoreBackend
bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
prefix: validation/great_expectations/validations
evaluation_parameter_store:
# Evaluation Parameters enable dynamic expectations. Read more here:
# https://docs.greatexpectations.io/docs/reference/evaluation_parameters/
class_name: EvaluationParameterStore
checkpoint_S3_store:
class_name: CheckpointStore
store_backend:
class_name: TupleS3StoreBackend
bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
prefix: validation/great_expectations/checkpoints/
profiler_S3_store:
class_name: ProfilerStore
store_backend:
class_name: TupleS3StoreBackend
suppress_store_backend_id: true
bucket: aws-glue-scripts-${AWS_ACCOUNT_ID}-${AWS_ACCOUNT_REGION}
prefix: validation/great_expectations/profilers/
expectations_store_name: expectations_S3_store
validations_store_name: validations_S3_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_S3_store
data_docs_sites:
# Data Docs make it simple to visualize data quality in your project. These
# include Expectations, Validations & Profiles. The are built for all
# Datasources from JSON artifacts in the local repo including validations &
# profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
local_site:
class_name: SiteBuilder
# set to false to hide how-to buttons in Data Docs
show_how_to_buttons: false
store_backend:
class_name: TupleS3StoreBackend
bucket: company-data-val-docs-${AWS_ACCOUNT_ID}
site_index_builder:
class_name: DefaultSiteIndexBuilder
anonymous_usage_statistics:
enabled: True
and here's my setup script:
def setup_runtime_data_context(local=False):
if local:
return ge.get_context()
s3 = boto3.client("s3")
os.environ["AWS_ACCOUNT_ID"] = etl_utils.DbInfo._get_account_id()
os.environ["AWS_ACCOUNT_REGION"] = etl_utils.DbInfo._get_region()
great_expectations_yml = s3.get_object(
Bucket=ge_expectations_yml_s3_bucket, Key=ge_expectations_yml_s3_key)["Body"]
yml = yaml.YAML()
cfg = yml.load(great_expectations_yml)
ds_cfg = DatasourceConfig(
**cfg.get("datasources").get("my_spark_dataframe"))
dc_cfg = DataContextConfig(
datasources={"my_spark_dataframe": ds_cfg}, stores=cfg.get("stores"))
cfg.pop("datasources")
cfg.pop("stores")
dc_cfg.__dict__.update(cfg)
return BaseDataContext(project_config=dc_cfg)
and here's the main script (I omitted the spark dataframe creation)
batch_request = RuntimeBatchRequest(
datasource_name="my_spark_dataframe",
data_connector_name="default_runtime_data_connector_name",
data_asset_name=query,
batch_identifiers={"batch_id": "default_identifier"},
runtime_parameters={"batch_data": df})
context.run_checkpoint(
checkpoint_name="daily_redshift_validation",
batch_request=batch_request,
expectation_suite_name=exp_suite,
run_name=t_name
)
As I said I had to revert to 0.15.13 in order to make it to work again, after that version it reports error:
return BaseDataContext(project_config=dc_cfg)
File "/usr/local/lib/python3.7/site-packages/great_expectations/core/usage_statistics/usage_statistics.py", line 294, in usage_statistics_wrapped_method
result = func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/base_data_context.py", line 297, in __init__
project_config=project_config, runtime_environment=runtime_environment
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/ephemeral_data_context.py", line 46, in __init__
super().__init__(runtime_environment=runtime_environment)
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 178, in __init__
self._init_stores(self.project_config_with_variables_substituted.stores)
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 1416, in _init_stores
self._build_store_from_config(store_name, store_config)
File "/usr/local/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 1346, in _build_store_from_config
"manually_initialize_store_backend_id": self.variables.anonymous_usage_statistics.data_context_id
AttributeError: 'dict' object has no attribute 'data_context_id'
Hi @planck-length , I'm Will with the OSS team at Great Expectations. I really appreciate you sharing your configuration.. it's really helped with the investigation work with the issue.
So the cause of the error has to do with some additional checks we have been doing for DataContext
, particularly around configurations that are now stored as DataContextVariables
.
What that means that is the __dict__.update(cfg)
at the end of the setup script prevents some of the Config
objects, like the AnonymizedUsageStatisticsConfig
from being instantiated properly, hence the AttributeError
.
Here is an updated setup_runtime_data_context()
script, where we load the DataContextConfig
directly from cfg
and replace the .datasources
attribute with the one that we want (ds_cfg
). Would you be able to give it a shot and let us know how it goes?
def updated_setup_runtime_data_context(local=False):
if local:
return ge.get_context()
s3 = boto3.client("s3")
os.environ["AWS_ACCOUNT_ID"] = etl_utils.DbInfo._get_account_id()
os.environ["AWS_ACCOUNT_REGION"] = etl_utils.DbInfo._get_region()
great_expectations_yml = s3.get_object(
Bucket=ge_expectations_yml_s3_bucket, Key=ge_expectations_yml_s3_key)["Body"]
yml = yaml.YAML()
cfg = yml.load(great_expectations_yml)
ds_cfg = DatasourceConfig(
**cfg.get("datasources").get("my_spark_dataframe"))
dc_cfg = DataContextConfig(**cfg)
master_cfg.datasources["my_new_dataframe"] = ds_cfg
return BaseDataContext(project_config=master_cfg)
Hello @Shinnnyshinshin I just copied your updated code and ran it on upgraded version, and it didn't failed. I'll do full check later today. Thank you for you help!
great to hear @planck-length. Let us know how the full test goes :)
Hi @planck-length wanting to check in on this issue. Did the script work with the full test?
@Shinnnyshinshin hi, sorry for delay in response, yes script ran all my suites, no failure with execution. Thanks again for your help!
Wonderful :) Glad to help. i'll close this issue now