PandasSchema icon indicating copy to clipboard operation
PandasSchema copied to clipboard

Handling empty string in DateFormatValidation

Open Natalie-Caruana opened this issue 3 years ago • 5 comments

Hi, I am experimenting with the DateFormatValidation validator on the following test example:

import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation

schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M')],allow_empty=True)])

test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
    print(error)

It seems to me that there is a bug with regards to handling empty string '' as I'm not getting back any errors. I do get an error however if I use double whitespace for example. But it looks like the validator considers empty string as NULL. Other than that thanks alot for this package really handy!

Natalie-Caruana avatar Oct 03 '20 10:10 Natalie-Caruana

Okay so the current allow_empty behaviour is this: https://github.com/TMiguelT/PandasSchema/blob/eae711871e350519d6df48547d3000a7b6fedcb3/pandas_schema/validation.py#L86-L92

So basically allow_empty only checks for null in numeric type and categorical types. For the others we check for 0 string length. But in hindsight, I'm not really sure why I did this, because it is possible for a dtype=str Series to have null values, even if you do so via read_csv.

Unfortunately this would be a breaking change so I can't fix it without a major version, but I'll try to work out something more logical for 1.0.0.

In the meantime, you can use some kind of boolean logic to work around this. Maybe DateFormatValidation('%Y-%m-%d') & CustomSeriesValidation(lambda s: s.str.len == 0, 'had an empty string')

multimeric avatar Oct 03 '20 14:10 multimeric

I tried adding the CustomSeriesValidation as shown below:

import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation,CustomSeriesValidation
schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d'),CustomSeriesValidation(lambda s: len(str(s))==0,'has an empty string')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M')],allow_empty=True)])
test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
    print(error)

however i'm getting the following output

{row: 1, column: "Date"}: "2009-01-04" has an empty string

the error does not seem to be pointing to the correct location.

Natalie-Caruana avatar Oct 06 '20 09:10 Natalie-Caruana

Oh, is should be CustomSeriesValidation(lambda s: s.str.len != 0, 'had an empty string') (not equals, rather than equals)

multimeric avatar Oct 06 '20 09:10 multimeric

Replacing "==" with "!=" does not return any errors :confused:

import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation,CustomSeriesValidation
schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d'),CustomSeriesValidation(lambda s: len(str(s))!=0,'has an empty string')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M'),CustomSeriesValidation(lambda s: len(str(s))!=0,'has an empty string')],allow_empty=True)])
test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
    print(error)

Natalie-Caruana avatar Oct 06 '20 10:10 Natalie-Caruana

Oh right. I think the allow_empty is overriding the empty string behaviour. At this point I think it's easiest to remove allow_empty, and or two validations together. Indicating that this series is valid if it's null or if it's a valid date, but not otherwise.

Also note that in your example you're not using the CustomSeriesValidation correctly, as the argument s is a pandas Series, and not a Python string, so you can't use len in that way (or at least I don't think you can?)

Column(
    'Datetime',
    [
        CustomSeriesValidation(lambda s: s.isnull(), 'is not null') | 
        DateFormatValidation('%Y-%m-%d %H:%M')
    ], allow_empty=False
)

multimeric avatar Oct 06 '20 16:10 multimeric