Bigdata profiler

This is a tool to profile your incoming data, check if it adheres to registered schema and do custom data quality checks. At the end of all this, a human readable report is autogenerated that can be sent over to stakeholders.

Features

Config driven data profiling and schema validation
Autogeneration of report after every run
Integration with datadog monitoring system
Extensible and highly customizable.
Very little boiler plate code.
Support for versioned schema validation.

Dataformats currently supported

CSV
JSON
Parquet

can easily be extended to all the formats that Apache Spark supports for reads.

SQL support for custom data quality checks

Supports both ANSI-SQL as well as Hive QL. List of all supported SQL functions can be found here

Datavalidator notebook tool
Sample dataset dataset
Sample dataset schema
Sample result report
Runner script

Run Instructions

All one has to do is execute a python script papermill_notebook_runner.py. This script takes in the following arguments in order:

Path to the notebook to be run.
Path to the output notebook.
JSON configuration that will drive the notebook.

python papermill_notebook_runner.py data-validator.ipynb output/data-validator.ipynb '{"dataFormat":"json","inputDataLocation":"s3a://bucket/prefix/generated.json","appName":"cust-profile-data-validation","schemaRepoUrl":"http://schemarepohostaddress","scheRepoSubjectName":"cust-profile","schemaVersionId":"0","customQ1":"select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset","customQ1ResultThreshold":0,"customQ1Operator":"=","customQ2":"select CAST(length(phone) as Long) from dataset","customQ2ResultThreshold":17,"customQ2Operator":"=","customQ3":"select CAST(count(distinct gender) as Long) from dataset","customQ3ResultThreshold":3,"customQ3Operator":"<="}'

Install Instructions

There are several pieces involved.

First install jupyter notebooks. Install instructions here.
Next install spark magic. Install instructions here
Configure sparkmagic with your own Apache Livy endpoints. Config file should look like this
Install papermill from source after adding spark-magic kernels. Clone papermill project from here.
Update the translators file to add sparkmagic kernels at the very end of the file.

papermill_translators.register("sparkkernel", ScalaTranslator)
papermill_translators.register("pysparkkernel", PythonTranslator)
papermill_translators.register("sparkrkernel", RTranslator)

Next install schema repo. Install instructions here.

More details

Find more details on this guide

That should be it. Enjoy Profiling !!

bigdata-profiler
bigdata-profiler copied to clipboard

Metadata

Bigdata profiler

Features

Dataformats currently supported

SQL support for custom data quality checks

Contents

Run Instructions

Install Instructions

More details

← Metadata

Owner

Metadata

bigdata-profiler bigdata-profiler copied to clipboard

Metadata

Bigdata profiler

Features

Dataformats currently supported

SQL support for custom data quality checks

Contents

Run Instructions

Install Instructions

More details

← Metadata

Owner

Metadata

bigdata-profiler
bigdata-profiler copied to clipboard