great_expectations
great_expectations copied to clipboard
Testing YAML config takes significant amount of time when partitioning large table
Describe the bug
When testing a YAML config that defines a table partition, Great Expectations executes a SELECT DISTINCT <partition columns> FROM <table>
query. For a large table (nearly half a trillion rows), that takes a significant amount of time (on the order of hours). This also appears to happen when fetching the validator, even if the batch_request specifies a single partition. It appears this is the result of ConfiguredAssetSqlDataConnector.get_batch_definition_list_from_batch_request()
refreshing its data reference cache, which attempts to fetch all partitions.
To Reproduce Steps to reproduce the behavior:
- Define a datasource YAML config with a
tables
section that partitions a very large table on multiple columns (e.g., date, hour, project_id).
datasource_yaml = f"""
name: {datasource_name}
class_name: SimpleSqlalchemyDatasource
credentials:
host: {snowflake_host}
username: {snowflake_user}
password: {snowflake_password}
database: {snowflake_db}
query:
schema: {snowflake_schema}
warehouse: {snowflake_warehouse}
role: {snowflake_role}
connect_args:
authenticator: {snowflake_authenticator}
drivername: snowflake
tables: # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
project:
partitioners:
project_by_date_and_hour:
include_schema_name: true
schema_name: {snowflake_schema}
data_asset_name_suffix: _project_id
splitter_method: _split_on_multi_column_values
splitter_kwargs:
column_names: [date, hour, project_id]
"""
- Validate the YAML config using
context.test_yaml_config
or attempt to fetch a validator using a batch request that specifies a single partition viabatch_filter_parameters
- Generated query to populate reference cache will attempt to fetch all partitions, even in the case that a single partition is specified, which can take a significant amount of time for large tables with high cardinality across the partition columns
Expected behavior Options to limit/disable cache refreshing or otherwise improve performance
Environment (please complete the following information):
- Operating System: Linux
- Great Expectations Version: 0.15.2
- DB: Snowflake
Thanks for raising this, @cbuffett - we will review and be in touch.
Hi @cbuffett - thanks for your patience. I wanted to check, is this still an issue for you? If so, we were planning on figuring out prioritization this week
@talagluck Yes, this would pose a serious performance impact, particularly around using the validator with batch parameters defined that correspond to a single table partition. The performance impact for testing YAML configurations is secondary as this isn't something that would run inside a production environment.