deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Deequ not working within Databricks (

Open gbalachandra-takeda opened this issue 4 years ago • 2 comments

Issue: Am facing the similar error using DataBricks (Using python or Scala) **Error using python **: TypeError: 'JavaPackage' object is not callable **Error using scala **: command-2987343:5: error: object deequ is not a member of package com.amazon import com.amazon.deequ.{VerificationSuite, VerificationResult}

command-2987343:8: error: object deequ is not a member of package com.amazon

Version python version: 3.7.9 pyspark - 2.4.0 scala version: 2.13.4

Tried: Downloaded the suggested Jars(deequ-1.0.5.jar) and uploaded to Databricks filestore and passed the same for spark session

Script Python:

import pydeequ
import sagemaker_pyspark
from pyspark.sql import SparkSession, Row
classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars
spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", '/FileStore/jars/deequ_1_0_5.jar')
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

Script for Scala:


%scala
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.concat
import com.amazon.deequ.{VerificationSuite, VerificationResult}
import com.amazon.deequ.VerificationResult.checkResultsAsDataFrame
import com.amazon.deequ.checks.{Check, CheckLevel, CheckStatus}
import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules}
import com.amazon.deequ.analyzers._
import com.amazon.deequ.analyzers.runners.AnalysisRunner
import com.amazon.deequ.analyzers.runners.AnalyzerContext.successMetricsAsDataFrame
import com.amazon.deequ.analyzers.{Analysis, ApproxCountDistinct, Completeness, Compliance, Distinctness, InMemoryStateProvider, Size}

val data_path = "/tmp/StreamingDataQuality/source/"
val checkpoint_path = "/tmp/StreamingDataQuality/checkpoint/"
val base_df = spark.read.parquet(data_path)
val empty_df = base_df.where("0 = 1")
val l1: Long = 0

spark.sql("DROP TABLE IF EXISTS trades_delta")
spark.sql("DROP TABLE IF EXISTS bad_records")
spark.sql("DROP TABLE IF EXISTS deequ_metrics")

base_df.createOrReplaceTempView("trades_historical")
empty_df.write.format("delta").saveAsTable("trades_delta")
empty_df.withColumn("batchID",lit(l1)).write.format("delta").saveAsTable("bad_records")
dbutils.fs.mkdirs(checkpoint_path)

Attached Screenshot of PyDeequ Error

deequ_error

Scala error

scala_deequ_error

Could anyone please suggest me the appropriate version, steps and scripts for data-bricks implementations

gbalachandra-takeda avatar Feb 07 '22 03:02 gbalachandra-takeda

DBR 9.1 LTS and simply importing the latest maven package ([email protected]) from the Library settings page works for me.

For DBR 9.1 LTS above, we will need to wait until this issue #380 is resolved.

kelvinluime avatar Feb 10 '22 06:02 kelvinluime

go to environment variables in databricks and add this line: SPARK_VERSION=3.0.1

Also, make sure you have installed this particular jar: deequ-1.1.0_spark-3.0-scala-2.12.jar

Below is the link to download: https://mvnrepository.com/artifact/com.amazon.deequ/deequ/1.1.0_spark-3.0-scala-2.12

Now, final step, make sure you're using this particular cluster with config: 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)

It should work!

Thanks

dlakkad avatar May 18 '22 13:05 dlakkad