bigdata-profiler icon indicating copy to clipboard operation
bigdata-profiler copied to clipboard

Profiles the data, validates the schema and runs data quality checks and produces a report

Bigdata profiler

This is a tool to profile your incoming data, check if it adheres to registered schema and do custom data quality checks. At the end of all this, a human readable report is autogenerated that can be sent over to stakeholders.

Features

  • Config driven data profiling and schema validation
  • Autogeneration of report after every run
  • Integration with datadog monitoring system
  • Extensible and highly customizable.
  • Very little boiler plate code.
  • Support for versioned schema validation.

Dataformats currently supported

  • CSV
  • JSON
  • Parquet

can easily be extended to all the formats that Apache Spark supports for reads.

SQL support for custom data quality checks

Supports both ANSI-SQL as well as Hive QL. List of all supported SQL functions can be found here

Contents

  • Datavalidator notebook tool
  • Sample dataset dataset
  • Sample dataset schema
  • Sample result report
  • Runner script

Run Instructions

All one has to do is execute a python script papermill_notebook_runner.py. This script takes in the following arguments in order:

  • Path to the notebook to be run.
  • Path to the output notebook.
  • JSON configuration that will drive the notebook.
python papermill_notebook_runner.py data-validator.ipynb output/data-validator.ipynb '{"dataFormat":"json","inputDataLocation":"s3a://bucket/prefix/generated.json","appName":"cust-profile-data-validation","schemaRepoUrl":"http://schemarepohostaddress","scheRepoSubjectName":"cust-profile","schemaVersionId":"0","customQ1":"select CAST(count(_id) - count(distinct _id) as Long) as diff from dataset","customQ1ResultThreshold":0,"customQ1Operator":"=","customQ2":"select CAST(length(phone) as Long) from dataset","customQ2ResultThreshold":17,"customQ2Operator":"=","customQ3":"select CAST(count(distinct gender) as Long) from dataset","customQ3ResultThreshold":3,"customQ3Operator":"<="}'

Install Instructions

There are several pieces involved.

  • First install jupyter notebooks. Install instructions here.
  • Next install spark magic. Install instructions here
  • Configure sparkmagic with your own Apache Livy endpoints. Config file should look like this
  • Install papermill from source after adding spark-magic kernels. Clone papermill project from here.
  • Update the translators file to add sparkmagic kernels at the very end of the file.
papermill_translators.register("sparkkernel", ScalaTranslator)
papermill_translators.register("pysparkkernel", PythonTranslator)
papermill_translators.register("sparkrkernel", RTranslator)
  • Next install schema repo. Install instructions here.

More details

Find more details on this guide

That should be it. Enjoy Profiling !!