pyjanitor
pyjanitor copied to clipboard
[ENH] Introduce a Dataframe validation method?
I'm working on a .validate()
method that would validate a DataFrame for certain characteristics.
Most of the examples are taken from this stackoverflow answer, but there are a few others that I think are important to add (such as regex matching and uniqueness over multiple columns).
We could just include PandasSchema to the project, but it's a bit wordy when creating the schema and it doesn't validate for uniqueness across multiple columns (which is a pretty big deal in my specific use-cases). PandasSchema
mentions this and alludes to doing more than just column level validation, such as this comment.
My schema looks like this:
schema={
('firstname', 'lastname'): 'unique',
'login': 'len<=8',
'phonenumber': 'regex:((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}'
}
and then somewhere in the chained methods one would call
df.validate(schema)
What's nice about PandasSchema
is that it outputs the rows that don't match the validation; mine doesn't at the moment. I'm thinking about throwing exceptions, though, so at the program execution stops if the DataFrame
doesn't match validation. One could also have different schemas and pepper them throughout the method chain to ensure the DataFrame
is transforming correctly. My main concern is ensuring that a DataFrame is 'correct' enough to save to a database table with a set schema.
The main item I'm currently struggling with are whether I should throw Exceptions
, and if not, what should the validation do? I think outputting the values that don't match the schema is reasonable, but I guess I don't see how that helps so much in the method chain.
@szuckerman I just realized I let this ball drop.
We now have a data dictionary thingy, and I think you could add another custom accessor that allows for validation of a dataframe. Or it could just be a function, just as you provided. What do you think?
Btw, I'm so happy you brought up PandasSchema; the code shown on their examples are really awesome!