deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Add support for Spark 3.2

Open alexott opened this issue 3 years ago • 13 comments

As part of "[SPARK-35558] Optimizes for multi-quantile retrieval", Spark 3.2 changed the signature of ApproximatePercentile.getPercentiles function and this broke the Deequ:

NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile$PercentileDigest.getPercentiles([D)[D
	at com.amazon.deequ.analyzers.ApproxQuantile.fromAggregationResult(ApproxQuantile.scala:84)
	at com.amazon.deequ.analyzers.ScanShareableAnalyzer.metricFromAggregationResult(Analyzer.scala:192)
	at com.amazon.deequ.analyzers.ScanShareableAnalyzer.metricFromAggregationResult$(Analyzer.scala:185)
	at com.amazon.deequ.analyzers.ApproxQuantile.metricFromAggregationResult(ApproxQuantile.scala:50)
	at com.amazon.deequ.analyzers.runners.AnalysisRunner$.successOrFailureMetricFrom(AnalysisRunner.scala:362)
	at com.amazon.deequ.analyzers.runners.AnalysisRunner$.$anonfun$runScanningAnalyzers$5(AnalysisRunner.scala:330)

alexott avatar Aug 31 '21 12:08 alexott

Thanks for sharing the issue. My understanding is that Spark 3.2 is not yet released. We'll add support for Spark 3.2 when it is released.

lange-labs avatar Sep 02 '21 12:09 lange-labs

As an update, Spark 3.2 has been released - are there any plans to support this?

jpugliesi avatar Oct 29 '21 15:10 jpugliesi

Yes. We're working on the release.

TammoR avatar Oct 29 '21 19:10 TammoR

Yes. We're working on the release.

Hi TammoR, do you have a timeline on when the support can be released. If you have a specific branch that has the support for 3.2. Please provide the link to that

sdandey avatar Nov 19 '21 21:11 sdandey

Hi sdandey. We're working on it on this branch: https://github.com/awslabs/deequ/tree/tammruka/2.0.0-spark-3.2.0 We do have limited bandwidth at the moment. If you're able to contribute to this branch towards supporting Spark 3.2, you would be most welcome to.

TammoR avatar Nov 22 '21 11:11 TammoR

Hi @TammoR

Correlation analyzer failing in tammruka/2.0.0-spark-3.2.0 #399

deenkar avatar Dec 01 '21 06:12 deenkar

Is there any update on this? Just checked out the branch and hit with this issue:

[ERROR] ## Exception when compiling 107 sources to /home/joesan/Projects/Private/scala-projects/deequ/target/classes
java.io.IOException: Cannot run program "/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/javac" (in directory "/home/joesan/Projects/Private/scala-projects/deequ"): error=2, No such file or directory

joesan avatar Dec 19 '21 17:12 joesan

Managed to fix few of the errors. Left with the following after updading pom.xml to Scala version 2.13 and Spark version to 3.2.

[INFO] compiling 106 Scala sources and 1 Java source to /home/joesan/Projects/Private/scala-projects/deequ/target/classes ...
[ERROR] /home/joesan/Projects/Private/scala-projects/deequ/src/main/scala/com/amazon/deequ/analyzers/QuantileNonSample.scala:55: missing argument list for method to in trait IterableOnceOps
Unapplied methods are only converted to functions when a function type is expected.
You can make this conversion explicit by writing `to _` or `to(_)` instead of `to`.
[ERROR] one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11.448 s
[INFO] Finished at: 2021-12-19T21:47:10+01:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:4.5.6:compile (scala-compile-first) on project deequ: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:4.5.6:compile failed: Compilation failed: InterfaceCompileFailed -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException

joesan avatar Dec 19 '21 20:12 joesan

Hi @TammoR and @joesan JFYI, spark 3.2.1 has been released.

And if we use spark 3.2.1 and skip tests, we can compile and produce an artifact.

First, change the spark version from 3.2.0 to 3.2.1 in the below line https://github.com/awslabs/deequ/blob/tammruka/2.0.0-spark-3.2.0/pom.xml#L21

After that, the scalastyle check will make the build fail

error file=/Users/JP28431/deequ/src/main/scala/com/amazon/deequ/analyzers/catalyst/StatefulApproxQuantile.scala message=File line length exceeds 100 characters line=34
error file=/Users/JP28431/deequ/src/main/scala/com/amazon/deequ/analyzers/catalyst/StatefulApproxQuantile.scala message=File line length exceeds 100 characters line=119

To fix (or bypass) this, you can use scalafmt to format the code or change the following settings of scalastyle plugin to false https://github.com/awslabs/deequ/blob/tammruka/2.0.0-spark-3.2.0/pom.xml#L215-L216

Then build with mvn clean install -DskipTests and the build has finished successfully. I had to skip the test because some tests are failing. I think if we fix the tests, the issue can be solved.

スクリーンショット 2022-02-02 0 27 01

tanvn avatar Feb 01 '22 15:02 tanvn

Hi @TammoR I just created a PR for this issue https://github.com/awslabs/deequ/pull/416 Could you please take a look ? It seems that I do not have the privilege of setting Reviewers and Assignees, so I would appreciate if you could take care of that part too 🙇

tanvn avatar Feb 11 '22 11:02 tanvn

Hi @TammoR Thank you for merging the PR! May I ask if there any blocker for Spark 3.2 ? I would be very grateful if you could share the current status of this issue 🙇

tanvn avatar Feb 14 '22 14:02 tanvn

Hi @tanvn Thanks for the great work on this issue! We merged the code with master.

A new Deequ version 2.0.1-spark-3.2 is now available.

lange-labs avatar Feb 16 '22 07:02 lange-labs

Hi @lange-labs @TammoR It looks like this ticket is still open, and now we're on Spark 3.3.0. is this a fluke? Are we good to add compatibility ask for 3.3.0?

AlecVivian avatar Dec 01 '22 21:12 AlecVivian