great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Data Docs hosting update on Azure overwrites previous docs validation

Open mateoxc opened this issue 1 year ago • 2 comments

Describe the bug Hey ,

I have a problem with azure data docs, while one person makes updates docs get appended on Azure blob storage, but if other person want append docs, Great Expectations overwrites and clear validations results of previous person.

To Reproduce Steps to reproduce the behavior:

  1. First person makes tables validations
  2. Update Azure blob storage docs
  3. On Azure blob storage are docs of first person
  4. Other team member make new validations
  5. Update Azure blob storage docs
  6. On Azure blob storage are only docs of second person

Expected behavior Great Expectations should not overwrite data docs on Azure blob storage if other team member update docs

Environment :

Operating System: Linux CentOS, JupyterLab Great Expectations Version: 0.15.26

mateoxc avatar Oct 06 '22 11:10 mateoxc

Howdy @mateoxc 👋 thank you for raising this with us :bow:

We'll bring it up with the team, and may need to reach out for a bit more information. We'll let you know.

AFineDayFor avatar Oct 12 '22 15:10 AFineDayFor

Any updates on this?

psxmc6 avatar Oct 18 '22 09:10 psxmc6

Hello,

Could anyone respond whether this issue is undergoing any investigation at the moment? It is a crucial bug to be addressed as otherwise, any long-term data validation across the team cannot be captured within Data Docs, which in a way, defies its purpose. The problem has been reported a month ago on GitHub, and I've came across some corresponding posts on Slack by some users, I will greatly appreciate any feedback on this matter.

I am looking forward to hearing from you.

Kind regards

psxmc6 avatar Nov 06 '22 00:11 psxmc6

Just FYI. Not sure how it relates to Azure, but as linked above I have the same issue with S3. What solved it for me was upgrading botocore (I have noticed my newer local env was working, while older docker env was not).

Broken botocore version: 1.19.63 Working botocore version: 1.27.80

EDIT: Ok, it did help, but then stopped working correctly again. I will dig in a little more, but seems like reinstalling packages triggered SOME change in a behaviour

adam-mrozik avatar Nov 07 '22 16:11 adam-mrozik

Ok, no idea. The same docker image on kubernetes always overwrites everything, while locally it creates new folder every time. Packages are the same, bucket is the same... No idea what else can differ to be honest

adam-mrozik avatar Nov 07 '22 17:11 adam-mrozik

We will see maybe someone will help us. In my opinion it's GE's issue, because the functions that send docs to cloud are different for azure and aws. I looked in the source code and tried to disable overwrite, but then it refused to write anything to blob.

mateoxc avatar Nov 07 '22 17:11 mateoxc

Hey @adam-mrozik @psxmc6 @mateoxc ! Thanks for reaching out and for the discussion here. I'll be bringing escalating these issues internally in the morning; any context you can provide around your configs & deployments similar to what @adam-mrozik has provided (in #6314) will help us to better understand what may be causing the behavior here. Thank you for your help and your patience! 🙇

austiezr avatar Nov 08 '22 22:11 austiezr

@austiezr great-expectations.yaml:

config_version: 3.0

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources:
  data-fabric-tables:
    data_connectors:
      my_runtime_data_connector:
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
        batch_identifiers:
          - default_identifier_name
    class_name: Datasource
    module_name: great_expectations.datasource
    execution_engine:
      class_name: SparkDFExecutionEngine
      module_name: great_expectations.execution_engine
      force_reuse_spark_context: true
  data-fabric-vardict:
    data_connectors:
      my_runtime_data_connector:
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
        batch_identifiers:
          - default_identifier_name
    class_name: Datasource
    module_name: great_expectations.datasource
    execution_engine:
      class_name: PandasExecutionEngine
      module_name: great_expectations.execution_engine
config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    # Evaluation Parameters enable dynamic expectations. Read more here:
    # https://docs.greatexpectations.io/docs/reference/evaluation_parameters/
    class_name: EvaluationParameterStore

  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

  profiler_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: profilers/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
  # Data Docs make it simple to visualize data quality in your project. These
  # include Expectations, Validations & Profiles. The are built for all
  # Datasources from JSON artifacts in the local repo including validations &
  # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
  IP Data Fabric QA:
    class_name: SiteBuilder
    # set to false to hide how-to buttons in Data Docs
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/data_fabric_qa/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
  IP Data Fabric QA (Azure Backup):
    class_name: SiteBuilder
    store_backend:
      class_name: TupleAzureBlobStoreBackend
      container: data-fabric-qa-docs
      connection_string: ${AZURE_STORAGE_WEB_CONNECTION_STRING}
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

Checkpoint run:

yaml = YAML()
context = ge.get_context()

batch_request = RuntimeBatchRequest(
    datasource_name="data-fabric-tables",
    data_connector_name="my_runtime_data_connector",
    data_asset_name=data_asset_name,  
    runtime_parameters={"batch_data": df},  
    batch_identifiers={"default_identifier_name": table_name},
)

my_checkpoint_name = table_name # This was populated from your CLI command.

yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1
class_name: SimpleCheckpoint
"""
context.add_checkpoint(**yaml.load(yaml_config))

context.run_checkpoint(
    run_name=checkpoint_run_name,
    checkpoint_name=table_name,
    validations=[
        {"expectation_suite_name": f"{table_name}.table_critical",
         "batch_request": batch_request},
        {"expectation_suite_name": f"{table_name}.columns",
         "batch_request": batch_request}
    ],
)

mateoxc avatar Nov 09 '22 10:11 mateoxc