soda-spark
soda-spark copied to clipboard
Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes
Soda Spark
Data testing, monitoring, and profiling for Spark Dataframes.
Soda Spark is an extension of Soda SQL that allows you to run Soda SQL functionality programmatically on a Spark data frame.
Soda SQL is an open-source command-line tool. It utilizes user-defined input to prepare SQL queries that run tests on tables in a data warehouse to find invalid, missing, or unexpected data. When tests fail, they surface "bad" data that you can fix to ensure that downstream analysts are using "good" data to make decisions.
Requirements
Soda Spark has the same requirements as
soda-sql-spark
.
Install
From your shell, execute the following command.
$ pip install soda-spark
Use
From your Python prompt, execute the following commands.
>>> from pyspark.sql import DataFrame, SparkSession
>>> from sodaspark import scan
>>>
>>> spark_session = SparkSession.builder.getOrCreate()
>>>
>>> id = "a76824f0-50c0-11eb-8be8-88e9fe6293fd"
>>> df = spark_session.createDataFrame([
... {"id": id, "name": "Paula Landry", "size": 3006},
... {"id": id, "name": "Kevin Crawford", "size": 7243}
... ])
>>>
>>> scan_definition = ("""
... table_name: demodata
... metrics:
... - row_count
... - max
... - min_length
... tests:
... - row_count > 0
... columns:
... id:
... valid_format: uuid
... tests:
... - invalid_percentage == 0
... sql_metrics:
... - sql: |
... SELECT sum(size) as total_size_us
... FROM demodata
... WHERE country = 'US'
... tests:
... - total_size_us > 5000
... """)
>>> scan_result = scan.execute(scan_definition, df)
>>>
>>> scan_result.measurements # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>> scan_result.test_results # doctest: +ELLIPSIS
[TestResult(test=Test(..., expression='row_count > 0', ...), passed=True, skipped=False, ...)]
>>>
Or, use a scan YAML file
>>> scan_yml = "static/demodata.yml"
>>> scan_result = scan.execute(scan_yml, df)
>>>
>>> scan_result.measurements # doctest: +ELLIPSIS
[Measurement(metric='schema', ...), Measurement(metric='row_count', ...), ...]
>>>
See the scan result object for all attributes and methods.
Or, return Spark data frames:
>>> measurements, test_results, errors = scan.execute(scan_yml, df, as_frames=True)
>>>
>>> measurements # doctest: +ELLIPSIS
DataFrame[metric: string, column_name: string, value: string, ...]
>>> test_results # doctest: +ELLIPSIS
DataFrame[test: struct<...>, passed: boolean, skipped: boolean, values: map<string,string>, ...]
>>>
See the _to_data_frame
functions in the scan.py
to see how the conversion is done.
Send results to Soda cloud
Send the scan result to Soda cloud.
>>> import os
>>> from sodasql.soda_server_client.soda_server_client import SodaServerClient
>>>
>>> soda_server_client = SodaServerClient(
... host="cloud.soda.io",
... api_key_id=os.getenv("API_PUBLIC"),
... api_key_secret=os.getenv("API_PRIVATE"),
... )
>>> scan_result = scan.execute(scan_yml, df, soda_server_client=soda_server_client)
>>>
Understand
Under the hood soda-spark
does the following.
- Setup the scan
- Use the Spark dialect
- Use Spark session as warehouse connection
- Create (or replace) global temporary view for the Spark data frame
- Execute the scan on the temporary view