PandasSchema
PandasSchema copied to clipboard
Handling empty string in DateFormatValidation
Hi, I am experimenting with the DateFormatValidation validator on the following test example:
import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation
schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M')],allow_empty=True)])
test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
print(error)
It seems to me that there is a bug with regards to handling empty string '' as I'm not getting back any errors. I do get an error however if I use double whitespace for example. But it looks like the validator considers empty string as NULL. Other than that thanks alot for this package really handy!
Okay so the current allow_empty
behaviour is this:
https://github.com/TMiguelT/PandasSchema/blob/eae711871e350519d6df48547d3000a7b6fedcb3/pandas_schema/validation.py#L86-L92
So basically allow_empty
only checks for null in numeric type and categorical types. For the others we check for 0 string length. But in hindsight, I'm not really sure why I did this, because it is possible for a dtype=str
Series to have null values, even if you do so via read_csv
.
Unfortunately this would be a breaking change so I can't fix it without a major version, but I'll try to work out something more logical for 1.0.0.
In the meantime, you can use some kind of boolean logic to work around this. Maybe DateFormatValidation('%Y-%m-%d') & CustomSeriesValidation(lambda s: s.str.len == 0, 'had an empty string')
I tried adding the CustomSeriesValidation as shown below:
import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation,CustomSeriesValidation
schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d'),CustomSeriesValidation(lambda s: len(str(s))==0,'has an empty string')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M')],allow_empty=True)])
test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
print(error)
however i'm getting the following output
{row: 1, column: "Date"}: "2009-01-04" has an empty string
the error does not seem to be pointing to the correct location.
Oh, is should be CustomSeriesValidation(lambda s: s.str.len != 0, 'had an empty string')
(not equals, rather than equals)
Replacing "==" with "!=" does not return any errors :confused:
import pandas as pd
from pandas_schema import Column, Schema
from pandas_schema.validation import DateFormatValidation,CustomSeriesValidation
schema = Schema([Column('Date', [DateFormatValidation('%Y-%m-%d'),CustomSeriesValidation(lambda s: len(str(s))!=0,'has an empty string')],allow_empty=True), Column('Datetime', [DateFormatValidation('%Y-%m-%d %H:%M'),CustomSeriesValidation(lambda s: len(str(s))!=0,'has an empty string')],allow_empty=True)])
test_data = pd.DataFrame([['','2010-09-10 11:10'],['2009-01-04','']],columns=['Date','Datetime'])
errors = schema.validate(test_data)
for error in errors:
print(error)
Oh right. I think the allow_empty
is overriding the empty string behaviour. At this point I think it's easiest to remove allow_empty
, and or
two validations together. Indicating that this series is valid if it's null or if it's a valid date, but not otherwise.
Also note that in your example you're not using the CustomSeriesValidation
correctly, as the argument s
is a pandas Series, and not a Python string, so you can't use len
in that way (or at least I don't think you can?)
Column(
'Datetime',
[
CustomSeriesValidation(lambda s: s.isnull(), 'is not null') |
DateFormatValidation('%Y-%m-%d %H:%M')
], allow_empty=False
)