great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

[FEATURE] Enable only ExactNumericRangeEstimator and QuantilesNumericRangeEstimator in "datetime_columns_rule" of OnboardingDataAssistant

Open alexsherstinsky opened this issue 1 year ago • 2 comments

Scope

Part A:

When the data consists of floating point numbers, then all of the estimators work properly by utilizing Python and Numpy numerical packages. However, when data consists of DateTime/TimeStamp values, only simple operations are supported. Complex arithmetic and statistical operations (mean, standard deviations, etc.) do not work with DateTime/TimeStamp-valued quantities, raising errors if encountered. The fact that the estimators aggregate the floating points (from a full column set to two numbers, min and max), makes it impossible to unambiguously discern the TimeZone of the resulting value range elements. The situation is made even worse if the original data contains DateTime/TimeStamp having multiple TimeZone attributes (datetime.datetime.tzinfo property). However, the metrics used in expectations, and hence part of validation, do preserve the TimeZone information for every metric that supports DateTime/TimeStamp data types. Such metrics include, for example, “column.min" and "column.max". On the other hand, “column.mean" and "column.standard_deviation", along with many other statistical metrics do not support DateTime/TimeStamp column values, raising an error if inadvertently attempted.

Since metrics/expectations that support the DateTime/TimeStamp formats also honor the TimeZone information, it is crucial for the profiler output (e.g., NumericRangeEstimator results) to do so as well for the corresponding metrics/expectations.

The solution is provide DateTime/TimeStamp support only when “exact” or “quantiles” estimators are specified, disallowing these data types for the more complex estimation algorithms (currently, “bootstrap” and “kde” estimators).

Hence, DateTime/TimeStamp data types will be currently supported only by ExactNumericRangeEstimator and QuantilesNumericRangeEstimator, but not by BootstrapNumericRangeEstimator or KdeNumericRangeEstimator, with any future NumericRangeEstimator implementations evaluated and tested to determine whether or not they can handle DateTime/TimeStamp-valued data.

Part B:

A long-standing error in TimeZone handling in ColumnValuesBetween._pandas() is identified and corrected, enabling full TimeZone support. A separate integration test is added for the expect_column_values_to_be_between expectation to demonstrate the enhanced capability.

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Remarks

Updated OnboardingDataAssistant handling of the way DateTime/TimeStamp-valued data ranges are computed needs to be documented.

Changes proposed in this pull request:

  • JIRA: GREAT-761/GREAT-1274

After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • [x] My code follows the Great Expectations style guide
  • [x] I have performed a self-review of my own code
  • [x] I have commented my code, particularly in hard-to-understand areas
  • [x] I have made corresponding changes to the documentation
  • [x] I have added unit tests where applicable and made sure that new and existing tests are passing.
  • [x] I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

alexsherstinsky avatar Sep 22 '22 03:09 alexsherstinsky

Deploy Preview for niobium-lead-7998 ready!

Name Link
Latest commit 1df63bd6596c278598179bef5cf838aad30f6beb
Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/632d15653b33650009a4279f
Deploy Preview https://deploy-preview-6063--niobium-lead-7998.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

netlify[bot] avatar Sep 22 '22 03:09 netlify[bot]

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

codesee-maps[bot] avatar Sep 22 '22 03:09 codesee-maps[bot]