ape
ape copied to clipboard
SPIKE: Use Pandera for dataframe validation/coercion?
Elevator pitch:
Pandera is a library for Dataframe validation. The creator, @cosmicBboy, is interested if Pandera might solve some of our issues within our query plugin system's design.
Value:
Pandera does data schema validation and can coerce data from different formats into dataframes for users. Our query plugin system currently is using iterators of our pydantic data models (namely BlockAPI
, TransactionAPI
and ContractEvent
classes) as the result return value from query plugins. Using Pandera, we may be able to dynamically form a schema from a QueryType
object and use that to coerce/shape/filter data streams from the different query plugins into the shape(s) required for downstream use (either as Dataframes for .query
methods, or as the data models themselves for .range
methods).
Additionally, Pandera is abstract over the types of Dataframes, which means it will have support for coercion into different types of dataframe classes for downstream use cases (Pandas, PySpark, Dask, etc.)
Task list:
- [ ] Can Pandera dynamical create schemas from
QueryType
objects? (probably from a method on_BaseQuery
) - [ ] Can Pandera schemas be used to coerce different types of data streams (ProviderAPI methods from
DefaultQueryProvider
, SQL results fromCacheQueryProvider
, or a REST API data stream from Kerkopes) in a more efficient way than requiring them to be converted intoIterator[BaseDataModel]
format right now? - [ ] Can Pandera increase the type safety of the
QueryAPI
plugin system for downstream integrators, and/or reduce the overhead of queries (for example, by letting queries return only relevant columns e.g. partial queries)? - [ ] How much additional overhead/complexity does Pandera add for the solution(s) it provides to the above?
- [ ] What is the pedigree of the library and how well does it fit with our current set of dependencies?
- [ ] How can we let users configure the type of "dataframe-like" libraries they wish to use with their calls to the
.query
methods?
hey @fubuloubu just following up here. Thanks for making the issue!
I can help out with some of this task list, but it'll probably be ~1-2 months before I can really get my hands dirty with this. Let me know if there's anything I can do to help with info, links, resources.