fire
fire copied to clipboard
Supporting pyspark for data processing
Setting data model is one thing, ensuring execution of production data pipelines is another. Spark being now de facto standard for enterprise data processing, this module programmatically interprets FIRE entities into spark execution pipelines resulting in the transmission and processing of high quality data in batch or real time.
- data schematization: programmatically read FIRE entities, supertypes and references and create their spark schema equivalent. Missing fields are still expected and data types are processed according to standards (e.g. a date is processed as a date and not a string)
- data quality: FIRE specifications are programmatically translated into spark SQL constraints (e.g. nullable field, cardinality, minimum, maximum, enums)
Benefits: removing the need for a data / ops team to re-code the FIRE models into pipelines, those are inferred from the JSON files.
Team, got the code to support both python 2 and 3. Happy to provide more context around that PR if needed
Hi @aamend - thanks for the high quality pull request (including tests, which is nice to see). We totally welcome this kind of interaction with the Fire schemas, however I think this repo is not the right place for it. This repo is more aimed at pure language-agnostic JSON schemas (the python here is only scripts for testing purposes rather than components of Fire itself). While it's true that Spark is immensely popular for data processing - the schemas are designed to be language agnostic, accessible and useful to a wide variety of institutions with varying tech stacks and capabilities.
On the other hand I think that it would make a good stand-alone python module (have you considered also adding it to PyPI?). If you wanted to make a separate repo for this project we would definitely be up for linking to it (and similar integrations) in the main README - and I could probably offer some help e.g. on linking to the schemas from your project
Thanks for the review. I completely get your point and agree in principle. In practice, we need to link both projects, either loosely (as you suggested) or tightly (as proposed in that change). I was thinking this would be the easiest way as it creates a simple bundle to pip install. If not, we'll need to ensure FIRE data is available as a dependency to a e.g. pyspark code
Realising wI've kind of dropped the ball here but keen to move forward. I can always publish my own repo but would need to discuss possible integrations first. Any chance you could reach out to [email protected]?