deequ Timestamp support

Timestamp support

Open stefan-grafberger opened this issue 5 years ago • 8 comments

Could someone a little bit more elaborated what should be done in this task?

Feb 17 '20 09:02 klangner

At the moment, deequ does not support any metrics calculations on timestamp/date columns. The task here would be to integrate those. A problem ist that most of our analyzers produce a DoubleMetric (e.g. for the max), but here we would need to implement a new Metric that operates on dates.

Feb 17 '20 15:02 sscdotopen

Is it ok to change Maximum and Minimum so it will support any type with total order instead of only Double? I can give it a try then.

Feb 17 '20 16:02 klangner

Lets try to make it support timestamps in addition to what it supports now. In general, we only operate on Spark's supported column types.

Feb 17 '20 16:02 sscdotopen

I have implemented whole new analyzers for timestamp with metrics that supports timestamp values. and added constraints that supports Spark's DateType and TimestampType to cover many use cases. I like to submit a PR if it is required?

Sep 23 '20 09:09 Yash0215

We would be very happy to receive such a PR!

Sep 23 '20 09:09 sscdotopen

Cannot wait for these features to be added!

Sep 30 '20 15:09 apython1998

For now you can compute the min/max for any datatype by aggregating the data yourself and using another metric. The HistogramMetric. Example below:

import org.apache.spark.sql.functions.{min, max}
import com.amazon.deequ.analyzers.{Histogram}
// init the data
case class Item(
  id: Long,
  productName: String,
  description: String,
  priority: String,
  numViews: Long
)


val rdd = spark.sparkContext.parallelize(Seq(
  Item(1, "Thingy A", "awesome thing.", "high", 0),
  Item(2, "Thingy B", "available at http://thingb.com", null, 0),
  Item(3, null, null, "low", 5),
  Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
  Item(5, "Thingy E", null, "high", 12)))

val data = spark.createDataFrame(rdd)

// compute the min/max directly and we filter nulls
val dataMinMax = (

data.filter($"productName".isNotNull).agg(min($"productName") as "minProductName", max($"productName") as "maxProductName")
)

// we now use a histogram

{ AnalysisRunner
  // data to run the analysis on
  .onData(dataMinMax)                       
  // define analyzers that compute metrics
  .addAnalyzer(Histogram("minProductName"))
  .addAnalyzer(Histogram("maxProductName"))
  .run()
  .allMetrics.foreach(println(_))
}

Ideally Deequ should allow users to add Metrics and Analyzers. The framework should be open for extensions.

Feb 28 '23 16:02 MassyB

deequ deequ copied to clipboard

Timestamp support

deequ
deequ copied to clipboard