spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[BUG]Exception while shutting down plugin com.nvidia.spark.SQLPlugin on Spark 3.1.1

Open GaryShen2008 opened this issue 3 years ago • 1 comments

Describe the bug I got an exception while shutting down plugin com.nvidia.spark.SQLPlugin when I ran a NDS2.0 query with local mode on Spark 3.1.1. It doesn't happen on Spark 3.2.1. The exception is as below. ai.rapids.cudf.RmmException: Could not shut down RMM there appear to be outstanding allocations at ai.rapids.cudf.Rmm.shutdown(Rmm.java:219) at ai.rapids.cudf.Rmm.shutdown(Rmm.java:179) at com.nvidia.spark.rapids.GpuDeviceManager$.shutdown(GpuDeviceManager.scala:146) at com.nvidia.spark.rapids.RapidsExecutorPlugin.shutdown(Plugin.scala:330) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$shutdown$4(PluginContainer.scala:144) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$shutdown$4$adapted(PluginContainer.scala:141) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.internal.plugin.ExecutorPluginContainer.shutdown(PluginContainer.scala:141) at org.apache.spark.executor.Executor.$anonfun$stop$4(Executor.scala:332) at org.apache.spark.executor.Executor.$anonfun$stop$4$adapted(Executor.scala:332) at scala.Option.foreach(Option.scala:407) at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:332) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:222) at org.apache.spark.executor.Executor.stop(Executor.scala:332) at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receiveAndReply$1.applyOrElse(LocalSchedulerBackend.scala:83) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

Steps/Code to reproduce bug Run NDS2.0 single query. Plugin: rapids-4-spark_2.12-22.08.0-20220803.095308-37-cuda11.jar Spark: 3.1.1

the power_run_gpu.template export SPARK_CONF=("--master" "local[*]" "--conf" "spark.rapids.memory.gpu.pool=ARENA" "--conf" "spark.driver.maxResultSize=2GB" "--conf" "spark.executor.cores=12" "--conf" "spark.rapids.sql.concurrentGpuTasks=2" "--conf" "spark.executor.memory=16G" "--conf" "spark.driver.memory=5G" "--conf" "spark.rapids.memory.gpu.allocFraction=1" "--conf" "spark.rapids.memory.gpu.minAllocFraction=0.005" "--conf" "spark.rapids.memory.gpu.maxAllocFraction=1" "--conf" "spark.sql.files.maxPartitionBytes=2gb" "--conf" "spark.rapids.memory.host.spillStorageSize=2G" "--conf" "spark.sql.adaptive.enabled=true" "--conf" "spark.plugins=com.nvidia.spark.SQLPlugin" "--conf" "spark.rapids.memory.pinnedPool.size=2g" "--jars" "$NDS_LISTENER_JAR")

Expected behavior No exception

GaryShen2008 avatar Aug 03 '22 15:08 GaryShen2008

This might be similar to #6001.

This exception is caused by us trying to shut down RMM, but there is memory that is still outstanding. This could be related to a memory leak, or it could be that broadcast join data was leaked on purpose and it just took too long for it to be released.

It would be interesting to see if this can be reproduced, but my guess is that it is the later issue so it happens randomly and rarely.

revans2 avatar Aug 03 '22 18:08 revans2