Deequ not working within Databricks (
Issue: Am facing the similar error using DataBricks (Using python or Scala) **Error using python **: TypeError: 'JavaPackage' object is not callable **Error using scala **: command-2987343:5: error: object deequ is not a member of package com.amazon import com.amazon.deequ.{VerificationSuite, VerificationResult}
command-2987343:8: error: object deequ is not a member of package com.amazon
Version python version: 3.7.9 pyspark - 2.4.0 scala version: 2.13.4
Tried: Downloaded the suggested Jars(deequ-1.0.5.jar) and uploaded to Databricks filestore and passed the same for spark session
Script Python:
import pydeequ
import sagemaker_pyspark
from pyspark.sql import SparkSession, Row
classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars
spark = (SparkSession
.builder
.config("spark.driver.extraClassPath", classpath)
.config("spark.jars.packages", '/FileStore/jars/deequ_1_0_5.jar')
.config("spark.jars.excludes", pydeequ.f2j_maven_coord)
.getOrCreate())
Script for Scala:
%scala
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.concat
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}
import com.amazon.deequ.analyzers._
import com.amazon.deequ.analyzers.runners.AnalysisRunner
import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame
import com.amazon.deequ.analyzers.{Analysis, ApproxCountDistinct, Completeness, Compliance, Distinctness, InMemoryStateProvider, Size}
val data_path = "/tmp/StreamingDataQuality/source/"
val checkpoint_path = "/tmp/StreamingDataQuality/checkpoint/"
val base_df = spark.read.parquet(data_path)
val empty_df = base_df.where("0 = 1")
val l1: Long = 0
spark.sql("DROP TABLE IF EXISTS trades_delta")
spark.sql("DROP TABLE IF EXISTS bad_records")
spark.sql("DROP TABLE IF EXISTS deequ_metrics")
base_df.createOrReplaceTempView("trades_historical")
empty_df.write.format("delta").saveAsTable("trades_delta")
empty_df.withColumn("batchID",lit(l1)).write.format("delta").saveAsTable("bad_records")
dbutils.fs.mkdirs(checkpoint_path)
Attached Screenshot of PyDeequ Error
Scala error
Could anyone please suggest me the appropriate version, steps and scripts for data-bricks implementations
DBR 9.1 LTS and simply importing the latest maven package ([email protected]) from the Library settings page works for me.
For DBR 9.1 LTS above, we will need to wait until this issue #380 is resolved.
go to environment variables in databricks and add this line: SPARK_VERSION=3.0.1
Also, make sure you have installed this particular jar: deequ-1.1.0_spark-3.0-scala-2.12.jar
Below is the link to download: https://mvnrepository.com/artifact/com.amazon.deequ/deequ/1.1.0_spark-3.0-scala-2.12
Now, final step, make sure you're using this particular cluster with config: 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)
It should work!
Thanks