great_expectations
great_expectations copied to clipboard
Data Docs hosting update on Azure overwrites previous docs validation
Describe the bug Hey ,
I have a problem with azure data docs, while one person makes updates docs get appended on Azure blob storage, but if other person want append docs, Great Expectations overwrites and clear validations results of previous person.
To Reproduce Steps to reproduce the behavior:
- First person makes tables validations
- Update Azure blob storage docs
- On Azure blob storage are docs of first person
- Other team member make new validations
- Update Azure blob storage docs
- On Azure blob storage are only docs of second person
Expected behavior Great Expectations should not overwrite data docs on Azure blob storage if other team member update docs
Environment :
Operating System: Linux CentOS, JupyterLab Great Expectations Version: 0.15.26
Howdy @mateoxc 👋 thank you for raising this with us :bow:
We'll bring it up with the team, and may need to reach out for a bit more information. We'll let you know.
Any updates on this?
Hello,
Could anyone respond whether this issue is undergoing any investigation at the moment? It is a crucial bug to be addressed as otherwise, any long-term data validation across the team cannot be captured within Data Docs, which in a way, defies its purpose. The problem has been reported a month ago on GitHub, and I've came across some corresponding posts on Slack by some users, I will greatly appreciate any feedback on this matter.
I am looking forward to hearing from you.
Kind regards
Just FYI. Not sure how it relates to Azure, but as linked above I have the same issue with S3. What solved it for me was upgrading botocore
(I have noticed my newer local env was working, while older docker env was not).
Broken botocore
version: 1.19.63
Working botocore version: 1.27.80
EDIT: Ok, it did help, but then stopped working correctly again. I will dig in a little more, but seems like reinstalling packages triggered SOME change in a behaviour
Ok, no idea. The same docker image on kubernetes always overwrites everything, while locally it creates new folder every time. Packages are the same, bucket is the same... No idea what else can differ to be honest
We will see maybe someone will help us. In my opinion it's GE's issue, because the functions that send docs to cloud are different for azure and aws. I looked in the source code and tried to disable overwrite, but then it refused to write anything to blob.
Hey @adam-mrozik @psxmc6 @mateoxc ! Thanks for reaching out and for the discussion here. I'll be bringing escalating these issues internally in the morning; any context you can provide around your configs & deployments similar to what @adam-mrozik has provided (in #6314) will help us to better understand what may be causing the behavior here. Thank you for your help and your patience! 🙇
@austiezr great-expectations.yaml:
config_version: 3.0
# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources:
data-fabric-tables:
data_connectors:
my_runtime_data_connector:
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
batch_identifiers:
- default_identifier_name
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
class_name: SparkDFExecutionEngine
module_name: great_expectations.execution_engine
force_reuse_spark_context: true
data-fabric-vardict:
data_connectors:
my_runtime_data_connector:
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
batch_identifiers:
- default_identifier_name
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
class_name: PandasExecutionEngine
module_name: great_expectations.execution_engine
config_variables_file_path: uncommitted/config_variables.yml
# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/
stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
expectations_store:
class_name: ExpectationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: expectations/
validations_store:
class_name: ValidationsStore
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/validations/
evaluation_parameter_store:
# Evaluation Parameters enable dynamic expectations. Read more here:
# https://docs.greatexpectations.io/docs/reference/evaluation_parameters/
class_name: EvaluationParameterStore
checkpoint_store:
class_name: CheckpointStore
store_backend:
class_name: TupleFilesystemStoreBackend
suppress_store_backend_id: true
base_directory: checkpoints/
profiler_store:
class_name: ProfilerStore
store_backend:
class_name: TupleFilesystemStoreBackend
suppress_store_backend_id: true
base_directory: profilers/
expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store
data_docs_sites:
# Data Docs make it simple to visualize data quality in your project. These
# include Expectations, Validations & Profiles. The are built for all
# Datasources from JSON artifacts in the local repo including validations &
# profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
IP Data Fabric QA:
class_name: SiteBuilder
# set to false to hide how-to buttons in Data Docs
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/data_fabric_qa/
site_index_builder:
class_name: DefaultSiteIndexBuilder
IP Data Fabric QA (Azure Backup):
class_name: SiteBuilder
store_backend:
class_name: TupleAzureBlobStoreBackend
container: data-fabric-qa-docs
connection_string: ${AZURE_STORAGE_WEB_CONNECTION_STRING}
site_index_builder:
class_name: DefaultSiteIndexBuilder
Checkpoint run:
yaml = YAML()
context = ge.get_context()
batch_request = RuntimeBatchRequest(
datasource_name="data-fabric-tables",
data_connector_name="my_runtime_data_connector",
data_asset_name=data_asset_name,
runtime_parameters={"batch_data": df},
batch_identifiers={"default_identifier_name": table_name},
)
my_checkpoint_name = table_name # This was populated from your CLI command.
yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
"""
context.add_checkpoint(**yaml.load(yaml_config))
context.run_checkpoint(
run_name=checkpoint_run_name,
checkpoint_name=table_name,
validations=[
{"expectation_suite_name": f"{table_name}.table_critical",
"batch_request": batch_request},
{"expectation_suite_name": f"{table_name}.columns",
"batch_request": batch_request}
],
)