soda-core icon indicating copy to clipboard operation
soda-core copied to clipboard

ISO 8601 format check fails on valid dates.

Open Ingmarvdg opened this issue 1 year ago • 13 comments

When checking if a datetime string is ISO 8601 compliant, some valid datetimes fail.

Dates in the 10th month.

The date 2020-10-11 fails while the dates 2020-09-11 and 2020-11-12 are fine. It seems this is caused by this section in the regular expression ?((0[0-9]|1[12]) that should be ?((0[0-9]|1[0-2]).

Dates before 1900 or after 2099

The dates 2100-01-01 and 1899-01-01 fail due to this section in the regular expression *(19|20)[[:digit:]][[:digit:]].

Ingmarvdg avatar Jun 17 '24 09:06 Ingmarvdg

SAS-3692

tools-soda avatar Jun 17 '24 09:06 tools-soda

Same goes for any valid format value that would involve the year or year4 regexes: date inverse, for example.

pholser avatar Jun 17 '24 20:06 pholser

hi thanks for reporting, could you verify if this is not fixed by https://github.com/sodadata/soda-core/pull/2128?

m1n0 avatar Jul 12 '24 18:07 m1n0

@Ingmarvdg would you be willing to try out the above fix for your case?

pholser avatar Jul 12 '24 18:07 pholser

@m1n0 @Ingmarvdg I can confirm that #2128 corrects the problem. However, I noticed also that ISO 8601 dates should accept 24-hr times, and currently they don't. I am issuing another PR to correct.

pholser avatar Jul 15 '24 16:07 pholser

See also #2133

pholser avatar Jul 15 '24 16:07 pholser

Hi @pholser I am using the spark-df version of soda-core, and it seems the problem is not yet corrected:

from soda.scan import Scan

scan = Scan()
scan.set_scan_definition_name("tmp")
scan.add_spark_session(spark, data_source_name="tmp")
scan.set_data_source_name("y")

scan.add_sodacl_yaml_str(
    """
checks for tmp:
    - invalid_count(datetime_string) = 0:
        valid format: date iso 8601
"""
)

df = spark.createDataFrame(
    [("1623-10-11T10:10:10.0000+01:00",)], schema=["datetime_string"] # This should pass
)
df.createOrReplaceTempView("tmp")

result_code = scan.execute()
results = scan.get_logs_text()
print(results)

Worse yet, the error has to do with the scan execution instead of resulting in a fail:

[19:05:02] Query execution error in y.a.failed_rows[invalid_count]: 
SELECT * FROM y 
 WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$')) 
 LIMIT 100

[19:05:02] Error occurred while executing scan.
  | 'DataFrame' object has no attribute 'offset'
INFO   | Soda Core 3.3.10
INFO   | Using DefaultSampler
ERROR  | Query execution error in y.a.failed_rows[invalid_count]: 
SELECT * FROM y 
 WHERE NOT (a IS NULL) AND NOT (a rlike('^ *(19|20)\\d\\d-?((0[0-9]|1[12])-?([012][0-9]|3[01])|W[0-5]\\d(-?[1-7])?|[0-3]\\d\\d)([ T](0[0-9]|1[012])(:?[0-5][0-9](:?[0-5][0-9]([.,]\\d+)?)?)?([+-](0[0-9]|1[012]):?[0-5][0-9]|Z)?)? *$')) 
 LIMIT 100 | 
ERROR  | Error occurred while executing scan. | 'DataFrame' object has no attribute 'offset'

Ingmarvdg avatar Jul 15 '24 19:07 Ingmarvdg

@Ingmarvdg -- ok, didn't know about the spark-df version of soda-core. Has it incorporated the above change? I'm satisfied that the iso 8601 date check is improved with the change in soda-core itself.

pholser avatar Jul 15 '24 21:07 pholser

The regex being fed to the query still appears to be the old incorrect one. I don't believe you've incorporated the soda-core update above into your setup.

pholser avatar Jul 16 '24 15:07 pholser

@Ingmarvdg what version are you using? I doubt the fix is in a released version yet.

pholser avatar Jul 16 '24 15:07 pholser

I pulled the most recent version of the main branch, then 'pip install .' from the spark-df folder. That should give your changes right?

Ingmarvdg avatar Jul 16 '24 16:07 Ingmarvdg

Perhaps I changed something that didn't affect your bug.

pholser avatar Jul 16 '24 18:07 pholser

Try also https://github.com/sodadata/soda-core/pull/2133

I added a format test for "1623-10-11T10:10:10.0000+01:00", and it seems to pass when the core tests are run for spark-df: https://github.com/sodadata/soda-core/actions/runs/9962160324/job/27525299481

pholser avatar Jul 16 '24 18:07 pholser