pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] Introduce a Dataframe validation method?

Open szuckerman opened this issue 5 years ago • 1 comments

I'm working on a .validate() method that would validate a DataFrame for certain characteristics.

Most of the examples are taken from this stackoverflow answer, but there are a few others that I think are important to add (such as regex matching and uniqueness over multiple columns).

We could just include PandasSchema to the project, but it's a bit wordy when creating the schema and it doesn't validate for uniqueness across multiple columns (which is a pretty big deal in my specific use-cases). PandasSchema mentions this and alludes to doing more than just column level validation, such as this comment.

My schema looks like this:

schema={
		('firstname', 'lastname'): 'unique',
		'login': 'len<=8',
		'phonenumber': 'regex:((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}'
	}

and then somewhere in the chained methods one would call

df.validate(schema)

What's nice about PandasSchema is that it outputs the rows that don't match the validation; mine doesn't at the moment. I'm thinking about throwing exceptions, though, so at the program execution stops if the DataFrame doesn't match validation. One could also have different schemas and pepper them throughout the method chain to ensure the DataFrame is transforming correctly. My main concern is ensuring that a DataFrame is 'correct' enough to save to a database table with a set schema.

The main item I'm currently struggling with are whether I should throw Exceptions, and if not, what should the validation do? I think outputting the values that don't match the schema is reasonable, but I guess I don't see how that helps so much in the method chain.

szuckerman avatar Apr 17 '19 02:04 szuckerman

@szuckerman I just realized I let this ball drop.

We now have a data dictionary thingy, and I think you could add another custom accessor that allows for validation of a dataframe. Or it could just be a function, just as you provided. What do you think?

Btw, I'm so happy you brought up PandasSchema; the code shown on their examples are really awesome!

ericmjl avatar May 08 '19 21:05 ericmjl