deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Timestamp support

Open stefan-grafberger opened this issue 5 years ago • 8 comments

stefan-grafberger avatar Sep 27 '18 13:09 stefan-grafberger

Could someone a little bit more elaborated what should be done in this task?

klangner avatar Feb 17 '20 09:02 klangner

At the moment, deequ does not support any metrics calculations on timestamp/date columns. The task here would be to integrate those. A problem ist that most of our analyzers produce a DoubleMetric (e.g. for the max), but here we would need to implement a new Metric that operates on dates.

sscdotopen avatar Feb 17 '20 15:02 sscdotopen

Is it ok to change Maximum and Minimum so it will support any type with total order instead of only Double? I can give it a try then.

klangner avatar Feb 17 '20 16:02 klangner

Lets try to make it support timestamps in addition to what it supports now. In general, we only operate on Spark's supported column types.

sscdotopen avatar Feb 17 '20 16:02 sscdotopen

I have implemented whole new analyzers for timestamp with metrics that supports timestamp values. and added constraints that supports Spark's DateType and TimestampType to cover many use cases. I like to submit a PR if it is required?

Yash0215 avatar Sep 23 '20 09:09 Yash0215

We would be very happy to receive such a PR!

sscdotopen avatar Sep 23 '20 09:09 sscdotopen

Cannot wait for these features to be added!

apython1998 avatar Sep 30 '20 15:09 apython1998

For now you can compute the min/max for any datatype by aggregating the data yourself and using another metric. The HistogramMetric. Example below:

import org.apache.spark.sql.functions.{min, max}
import com.amazon.deequ.analyzers.{Histogram}
// init the data
case class Item(
  id: Long,
  productName: String,
  description: String,
  priority: String,
  numViews: Long
)


val rdd = spark.sparkContext.parallelize(Seq(
  Item(1, "Thingy A", "awesome thing.", "high", 0),
  Item(2, "Thingy B", "available at http://thingb.com", null, 0),
  Item(3, null, null, "low", 5),
  Item(4, "Thingy D", "checkout https://thingd.ca", "low", 10),
  Item(5, "Thingy E", null, "high", 12)))

val data = spark.createDataFrame(rdd)

// compute the min/max directly and we filter nulls
val dataMinMax = (

data.filter($"productName".isNotNull).agg(min($"productName") as "minProductName", max($"productName") as "maxProductName")
)

// we now use a histogram

{ AnalysisRunner
  // data to run the analysis on
  .onData(dataMinMax)                       
  // define analyzers that compute metrics
  .addAnalyzer(Histogram("minProductName"))
  .addAnalyzer(Histogram("maxProductName"))
  .run()
  .allMetrics.foreach(println(_))
}

Ideally Deequ should allow users to add Metrics and Analyzers. The framework should be open for extensions.

MassyB avatar Feb 28 '23 16:02 MassyB