ydata-profiling
ydata-profiling copied to clipboard
open to contributions and a collaboration with pandera?
Hi! First of all I'm a big fan of the library 🎉, I've been using it myself from its early days to now.
Missing functionality
pandera is a data validation library that makes it easy to define dataframe types and do run-time validation via type-annotations. It currently does a little bit of data profiling in order to support the schema inference feature, but I think it makes a lot of sense to leverage pandas-profiling's more advanced capabilities on this front (e.g. the statistical summaries, which in theory could be converted into hypothesis tests in pandera).
I was wondering if the pandas-profiling maintainers would be open to a contribution for functionality similar to the great expectations integration?
Proposed feature
The user API would be super straight-forward:
from pandas_profiling import ProfileReport
df = ...
profile = ProfileReport(df)
pandera_schema = profile.to_pandera_schema()
# validate the data itself
pandera_schema(df) # should pass
# validate new data
new_df = ...
pandera_schema(new_df) # may fail
Under the hood, pandera would use the profile.get_description() summary or profile.to_json() to construct a pandera schema, which users could then use directly in their script/notebook, or serialize with schema.to_yaml() or schema.to_script() if they want to reuse the schema in some other process.
I think it makes sense to implement the parsing/reconstruction logic on pandas-profiling because I want to be able to adopt the pattern of converting the vision typeset into the pandera type system (which are basically just aliases of numpy/pandas machine types), and looking at the great expectations integration it seems like pandas-profiling has a nice set of abstractions for handling the complexity of converting profiles to a data validation format.
On the pandera side, I'd want to add schema.from_profile to be able to read an in-memory or serialized profile (in json for example).
Alternatives considered
Implement the profile -> schema logic in pandera. This is possible, but as I mention above, I think using the type abstractions in pandas-profiling would be make for a smoother integration.
Hi @cosmicBboy, sounds interesting. Let's have a chat on the Slack channel.
Hi, is there any update on it?
@cosmicBboy Is this still a relevant development? I would be thrilled to discuss ideas.