jgit-spark-connector
jgit-spark-connector copied to clipboard
Engine fails to extract UASTs on actual Spark cluster
When running on local mode with --packages "tech.sourced:engine:0.6.3" - extracting UASTs works.
But after switching to actual Apache Spark cluster with the same params and query i.e in Standalone mode - extractUAST fails with
java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createUnstarted()Lcom/google/common/base/Stopwatch;
Steps to Reproduce
- Start Apache Spark in a cluster mode and a
spark-shell\w Engineexport MASTER_HOST=127.0.0.1 $SPARK_HOME/sbin/start-master.sh -h $MASTER_HOST -p 7077 $SPARK_HOME/sbin/start-slave.sh $MASTER_HOST:7077 $SPARK_HOME/bin/spark-shell --master "spark://$MASTER_HOST:7077" --packages "tech.sourced:engine:0.6.3" - Run
extractUASTsimport tech.sourced.engine._ val path = "<path-to-siva-files>" val engine = Engine(spark, path, "siva") val repos = engine.getRepositories val files = repos.getHEAD .getCommits .getTreeEntries .getBlobs val uast = files.extractUASTs uast.count
Expected Behavior
get the number of UASTs
Current Behavior
java.lang.NoSuchMethodError
18/06/18 10:39:00 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 192.168.1.37, executor 0): java.lang.NoSuchMethodError: com.google.common.base.Stopwatch.createUnstarted()Lcom/google/common/base/Stopwatch;
at io.grpc.internal.GrpcUtil$4.get(GrpcUtil.java:566)
at io.grpc.internal.GrpcUtil$4.get(GrpcUtil.java:563)
at io.grpc.internal.CensusStatsModule$ClientCallTracer.<init>(CensusStatsModule.java:333)
at io.grpc.internal.CensusStatsModule.newClientCallTracer(CensusStatsModule.java:137)
at io.grpc.internal.CensusStatsModule$StatsClientInterceptor.interceptCall(CensusStatsModule.java:672)
at io.grpc.ClientInterceptors$InterceptorChannel.newCall(ClientInterceptors.java:104)
at io.grpc.internal.ManagedChannelImpl.newCall(ManagedChannelImpl.java:636)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:61)
at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:30)
at tech.sourced.engine.util.Bblfsh$.extractUAST(Bblfsh.scala:80)
at tech.sourced.engine.udf.ExtractUASTsUDF$class.extractUASTs(ExtractUASTsUDF.scala:17)
at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTs(ExtractUASTsUDF.scala:24)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:395)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:377)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source)
Context
This mimics the file-duplication workflow we have in Gemini on hash. Ability to reproduce it in spark-shell is crucial for debugging.
Possible Solution
- update the build, so final fatJar avoids calling un-shaded version of Guava
- Add TravisCI profile that runs this query on actual local Apache Spark standalone cluster
Your Environment (for bugs)
- Spark version: 2.2.0
- Engine version: 0.6.3
- Operating System and version: tested on Linux and macOS
Update: this seems to be related with how --packages in Apache Spark work 😕
If spark-shell is started with:
--packages "tech.sourced:engine:0.6.3"->java.lang.NoSuchMethodError--jars <path-to-engine>/target/engine-0.6.3.jar-> works as expected
--packages used to work for me. This seems to be the usual dependency hell with conflicting Guava versions.
@bzz maybe can be related with some kind of cache used by the --packages command?
Maybe deleting .ivy2 folder solves the problem.
True. @smola and --packages used to work for me as well.
@ajnavarro Hmmm.. But it did not work neither on my local machine nor on from a new pod on staging pipeline cluster. Or do you mean some Spark Master-side cache?
A quick verification on a new pod \w empty cache and local standalone cluster:
kubectl run -i --tty spark-new --image=srcd/spark:2.2.0_v2 --generator="run-pod/v1" --command -- /bin/bash
whoami
ls -la /root
export SPARK_HOME="/opt/spark-2.2.0-bin-hadoop2.7"
export MASTER_HOST=127.0.0.1
$SPARK_HOME/sbin/start-master.sh -h $MASTER_HOST -p 7077
$SPARK_HOME/sbin/start-slave.sh $MASTER_HOST:7077
$SPARK_HOME/bin/spark-shell --master "spark://$MASTER_HOST:7077" --packages "tech.sourced:engine:0.6.4"
and then
import tech.sourced.engine._
val path = "hdfs://hdfs-namenode/pga/siva/latest/ff/"
val engine = Engine(spark, path, "siva")
val repos = engine.getRepositories
val files = repos.getHEAD
.getCommits
.getTreeEntries
.getBlobs
val uast = files.extractUASTs
uast.count
results in
18/06/18 12:07:43 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 2, 10.2.15.90, executor 0): java.lang.NoSuchMethodError: com.google.protobuf.Descriptors$Descriptor.getOneofs()Ljava/util/List;
at com.google.protobuf.GeneratedMessageV3$FieldAccessorTable.<init>(GeneratedMessageV3.java:1727)
at com.google.protobuf.DurationProto.<clinit>(DurationProto.java:52)
at com.google.protobuf.duration.DurationProto$.javaDescriptor$lzycompute(DurationProto.scala:26)
at com.google.protobuf.duration.DurationProto$.javaDescriptor(DurationProto.scala:25)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.GeneratedProto$.javaDescriptor$lzycompute(GeneratedProto.scala:63)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.GeneratedProto$.javaDescriptor(GeneratedProto.scala:59)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$.<init>(ProtocolServiceGrpc.scala:30)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$.<clinit>(ProtocolServiceGrpc.scala)
at org.bblfsh.client.BblfshClient.<init>(BblfshClient.scala:20)
@bzz it's a cache on master (or workers) side. I used to have the same problem. Removing cache helped. Reference: https://github.com/src-d/engine/issues/389