ape icon indicating copy to clipboard operation
ape copied to clipboard

SPIKE: Use Pandera for dataframe validation/coercion?

Open fubuloubu opened this issue 2 years ago • 1 comments

Elevator pitch:

Pandera is a library for Dataframe validation. The creator, @cosmicBboy, is interested if Pandera might solve some of our issues within our query plugin system's design.

Value:

Pandera does data schema validation and can coerce data from different formats into dataframes for users. Our query plugin system currently is using iterators of our pydantic data models (namely BlockAPI, TransactionAPI and ContractEvent classes) as the result return value from query plugins. Using Pandera, we may be able to dynamically form a schema from a QueryType object and use that to coerce/shape/filter data streams from the different query plugins into the shape(s) required for downstream use (either as Dataframes for .query methods, or as the data models themselves for .range methods).

Additionally, Pandera is abstract over the types of Dataframes, which means it will have support for coercion into different types of dataframe classes for downstream use cases (Pandas, PySpark, Dask, etc.)

Task list:

  • [ ] Can Pandera dynamical create schemas from QueryType objects? (probably from a method on _BaseQuery)
  • [ ] Can Pandera schemas be used to coerce different types of data streams (ProviderAPI methods from DefaultQueryProvider, SQL results from CacheQueryProvider, or a REST API data stream from Kerkopes) in a more efficient way than requiring them to be converted into Iterator[BaseDataModel] format right now?
  • [ ] Can Pandera increase the type safety of the QueryAPI plugin system for downstream integrators, and/or reduce the overhead of queries (for example, by letting queries return only relevant columns e.g. partial queries)?
  • [ ] How much additional overhead/complexity does Pandera add for the solution(s) it provides to the above?
  • [ ] What is the pedigree of the library and how well does it fit with our current set of dependencies?
  • [ ] How can we let users configure the type of "dataframe-like" libraries they wish to use with their calls to the .query methods?

fubuloubu avatar Aug 16 '22 19:08 fubuloubu

hey @fubuloubu just following up here. Thanks for making the issue!

I can help out with some of this task list, but it'll probably be ~1-2 months before I can really get my hands dirty with this. Let me know if there's anything I can do to help with info, links, resources.

cosmicBboy avatar Aug 23 '22 14:08 cosmicBboy