deequ
deequ copied to clipboard
Timestamp support
Could someone a little bit more elaborated what should be done in this task?
At the moment, deequ does not support any metrics calculations on timestamp/date columns. The task here would be to integrate those. A problem ist that most of our analyzers produce a DoubleMetric (e.g. for the max), but here we would need to implement a new Metric that operates on dates.
Is it ok to change Maximum and Minimum so it will support any type with total order instead of only Double? I can give it a try then.
Lets try to make it support timestamps in addition to what it supports now. In general, we only operate on Spark's supported column types.
I have implemented whole new analyzers for timestamp with metrics that supports timestamp values. and added constraints that supports Spark's DateType and TimestampType to cover many use cases. I like to submit a PR if it is required?
We would be very happy to receive such a PR!
Cannot wait for these features to be added!
For now you can compute the min/max for any datatype by aggregating the data yourself and using another metric. The HistogramMetric
. Example below:
import org.apache.spark.sql.functions.{min, max}
import com.amazon.deequ.analyzers.{Histogram}
// init the data
case class Item(
id: Long,
productName: String,
description: String,
priority: String,
numViews: Long
)
val rdd = spark.sparkContext.parallelize(Seq(
Item(1, "Thingy A", "awesome thing.", "high", 0),
Item(2, "Thingy B", "available at http://thingb.com", null, 0),
Item(3, null, null, "low", 5),
Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
Item(5, "Thingy E", null, "high", 12)))
val data = spark.createDataFrame(rdd)
// compute the min/max directly and we filter nulls
val dataMinMax = (
data.filter($"productName".isNotNull).agg(min($"productName") as "minProductName", max($"productName") as "maxProductName")
)
// we now use a histogram
{ AnalysisRunner
// data to run the analysis on
.onData(dataMinMax)
// define analyzers that compute metrics
.addAnalyzer(Histogram("minProductName"))
.addAnalyzer(Histogram("maxProductName"))
.run()
.allMetrics.foreach(println(_))
}
Ideally Deequ should allow users to add Metrics
and Analyzers
. The framework should be open for extensions.