great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Testing YAML config takes significant amount of time when partitioning large table

Open cbuffett opened this issue 2 years ago • 3 comments

Describe the bug When testing a YAML config that defines a table partition, Great Expectations executes a SELECT DISTINCT <partition columns> FROM <table> query. For a large table (nearly half a trillion rows), that takes a significant amount of time (on the order of hours). This also appears to happen when fetching the validator, even if the batch_request specifies a single partition. It appears this is the result of ConfiguredAssetSqlDataConnector.get_batch_definition_list_from_batch_request() refreshing its data reference cache, which attempts to fetch all partitions.

To Reproduce Steps to reproduce the behavior:

  1. Define a datasource YAML config with a tables section that partitions a very large table on multiple columns (e.g., date, hour, project_id).
datasource_yaml = f"""
name: {datasource_name}
class_name: SimpleSqlalchemyDatasource
credentials:
  host: {snowflake_host}
  username: {snowflake_user}
  password: {snowflake_password}
  database: {snowflake_db}
  query:
    schema: {snowflake_schema}
    warehouse: {snowflake_warehouse}
    role: {snowflake_role}
  connect_args:
    authenticator: {snowflake_authenticator}
  drivername: snowflake
      
tables:  # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
  project:
    partitioners:
      project_by_date_and_hour:
        include_schema_name: true
        schema_name: {snowflake_schema}
        data_asset_name_suffix: _project_id
        splitter_method: _split_on_multi_column_values
        splitter_kwargs:
          column_names: [date, hour, project_id]
"""
  1. Validate the YAML config using context.test_yaml_config or attempt to fetch a validator using a batch request that specifies a single partition via batch_filter_parameters
  2. Generated query to populate reference cache will attempt to fetch all partitions, even in the case that a single partition is specified, which can take a significant amount of time for large tables with high cardinality across the partition columns

Expected behavior Options to limit/disable cache refreshing or otherwise improve performance

Environment (please complete the following information):

  • Operating System: Linux
  • Great Expectations Version: 0.15.2
  • DB: Snowflake

cbuffett avatar Apr 26 '22 20:04 cbuffett

Thanks for raising this, @cbuffett - we will review and be in touch.

talagluck avatar Apr 29 '22 19:04 talagluck

Hi @cbuffett - thanks for your patience. I wanted to check, is this still an issue for you? If so, we were planning on figuring out prioritization this week

talagluck avatar Aug 09 '22 15:08 talagluck

@talagluck Yes, this would pose a serious performance impact, particularly around using the validator with batch parameters defined that correspond to a single table partition. The performance impact for testing YAML configurations is secondary as this isn't something that would run inside a production environment.

cbuffett avatar Aug 09 '22 17:08 cbuffett