deequ
deequ copied to clipboard
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Is there a simple quick start example for the usage of deequ in Spark SQL? If anyone has some suggestions or examples, that would be greatly helpful. Alternatively, it could...
I am trying to update my Databricks runtime to the newest version (DBR 11.0). However, the deequ package is not being installed properly. On the older Databricks runtimes the package...
### Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size). **Based on...
*Issue #, if available:* - Currently .hasPattern always fails for null values *Description of changes:* - I checked #342 , so I added isNullAllowed variable, and If isNullAllowed is true,...
*Issue #, if available:* https://github.com/awslabs/deequ/issues/380 *Description of changes:* Support for Scala 2.13 and Spark 3.2 It is not fully done as I face two issues before I could even successfully...
when using ColumnProfilerRunner function, how do i solve it? also I tried to work with this it on my mac - also same error glue version 2.0 spark 2.4 python...
New Features: 1. A Date Time Distribution analyzer for analyzing the distribution of the records based on 'DateType' or 'TimestampType' feature within fixed time intervals. files changed/created: DateTimeDistribution.scala DateTimeAggregation.scala DeequFunctions.scala...
I have around 50 columns in my table when I try to run entropy analyzer it is creating multiple jobs and they are not executing in parallel while completeness is...
Spark Version: 3.2.1 Scala Version : 2.13.8 This is what the error looks like : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$ at deequ5$.main(Main.scala:11) at deequ5.main(Main.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ And...
*Description of changes:* This PR lifts the private restrictions on the SERDE classes. It also reorganizes the code to a single object/class per file for easier code navigation and discovery....