great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

Checking date ranges since current method will be deprecated

Open ejohnson-amerilife opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. I would like to check that a datetime column is within a range dates since it looks like this functionality will be deprecated in a future release (see below).

Describe the solution you'd like I would like to have an expectation, or an adaptation to an existing expectation that allows me to check that a datetime column is within a range of dates.

Describe alternatives you've considered Currently i am using the expectation expect_column_values_to_be_between with the parameter parse_strings_as_datetimes=True, however this will be deprecated soon as indicated by the message: The parameter "parse_strings_as_datetimes" is no longer supported and will be deprecated in a future release. Please update code accordingly.

ejohnson-amerilife avatar Apr 12 '22 14:04 ejohnson-amerilife

Hi @ejohnson-amerilife - thanks for raising this!

The functionality to be deprecated is not the checking of datetime columns, but rather the checking of datetimes that are formatted as strings. You can read more about this decision here.

If you are interested in continuing support for this, please consider adding this functionality as a custom Expectation! The main idea is just that we want expect_column_values_to_be_between to do exactly what it says without doing other transformations along the way. As a reminder, you can also transform the data yourself prior to running Great Expectations.

Given this, I'm going to close this issue for now, but please feel free to re-open if you have additional questions or feedback.

talagluck avatar Apr 13 '22 18:04 talagluck

Thank you for your response @talagluck, and i totally agree with the decision to not have GE transform the data, however that is not what is happening in my case. From the docs the parse_strings_as_datetimes parameter is defined as:

parse_strings_as_datetimes (boolean or None) : If True, parse min_value, max_value, and all non-null column values to datetimes before making comparisons.

The key pieces of information are that the expectation's min_value and max_value must be entered as strings, therefore this parse_strings_as_datetimes param must be set to True even when the column is a datetime type.

If this is still not clear, i will be happy to post an example.

ejohnson-amerilife avatar Apr 13 '22 18:04 ejohnson-amerilife

Hi @ejohnson-amerilife - to confirm, why must the min_value and max_value be entered as strings?

An example would be really helpful - thank you!

talagluck avatar Apr 13 '22 19:04 talagluck

I have a datetime64[ns] column in a pandas dataframe, let's call the column "date_col". When I set parse_strings_as_datetimes=True, the test works as expected i.e. no exceptions are raised. However, when I set parse_strings_as_datetimes=False, as shown below:

df_ge.expect_column_values_to_be_between(
    column="date_col",
    min_value="2019-01-02",
    max_value="2021-12-30",
    parse_strings_as_datetimes=False
)

The validation result shows the following exception claiming Column values, min_value, and max_value must either be None or of the same type.

The full traceback message is:

Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/great_expectations/execution_engine/execution_engine.py", line 390, in resolve_metrics resolved_metrics[metric_to_resolve.id] = metric_fn( File "/usr/local/lib/python3.8/site-packages/great_expectations/expectations/metrics/metric_provider.py", line 55, in inner_func return metric_fn(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/great_expectations/expectations/metrics/map_metric_provider.py", line 333, in inner_func meets_expectation_series = metric_fn( File "/usr/local/lib/python3.8/site-packages/great_expectations/expectations/metrics/column_map_metrics/column_values_between.py", line 158, in _pandas return temp_column.map(is_between) File "/usr/local/lib/python3.8/site-packages/pandas/core/series.py", line 4161, in map new_values = super()._map_values(arg, na_action=na_action) File "/usr/local/lib/python3.8/site-packages/pandas/core/base.py", line 870, in _map_values new_values = map_f(values, mapper) File "pandas/_libs/lib.pyx", line 2859, in pandas._libs.lib.map_infer File "/usr/local/lib/python3.8/site-packages/great_expectations/expectations/metrics/column_map_metrics/column_values_between.py", line 100, in is_between raise TypeError( TypeError: Column values, min_value, and max_value must either be None or of the same type.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

Also, if i instead specify min_value and max_value as datetimes as follows:

df_ge.expect_column_values_to_be_between(
    column="date_col",
    min_value=datetime.fromisoformat("2019-01-02 00:00:00"),
    max_value=datetime.fromisoformat("2021-12-30 00:00:00"),
    parse_strings_as_datetimes=False
)

I still get the same exception as above.

ejohnson-amerilife avatar Apr 13 '22 20:04 ejohnson-amerilife

When I try the same with a date type column using pyspark,

 [{
                "expectation_type": "expect_column_min_to_be_between",
                "kwargs": {
                    "column": "started",
                    "min_value": "2020-12-13"
                     },
                "parse_strings_as_datetimes": True
            }]

I get the below error response:

"'>=' not supported between instances of 'datetime.datetime' and 'str'"

akhilnambiar29 avatar May 04 '22 06:05 akhilnambiar29

Hi @ejohnson-amerilife and @aqeelsmith - thank you for your patience! We made some changes in this area last week. Can you please confirm whether this is still an issue for you? Thank you!

talagluck avatar Aug 10 '22 08:08 talagluck

Hi Great Expectations team! This issue is still happening when using pyspark.

mollysrour avatar Feb 14 '24 14:02 mollysrour

When I try the same with a date type column using pyspark,

 [{
                "expectation_type": "expect_column_min_to_be_between",
                "kwargs": {
                    "column": "started",
                    "min_value": "2020-12-13"
                     },
                "parse_strings_as_datetimes": True
            }]

I get the below error response:

"'>=' not supported between instances of 'datetime.datetime' and 'str'"

Same when using: datasource: pandas asset: parquet

TypeError: '>=' not supported between instances of 'datetime.date' and 'str'

jmilagroso avatar Mar 24 '24 09:03 jmilagroso