cdap icon indicating copy to clipboard operation
cdap copied to clipboard

Add logging of Spark execution plan to facilitate debugging

Open DJSagarAhire opened this issue 2 years ago • 3 comments

This PR adds logging for the execution plan for Spark pipelines. The logging can be seen in the full raw logs.

Example log:

2022-08-11 20:59:59,771 - DEBUG [spark-submitter-phase-1-725c5b7c-198a-11ed-8838-000000bbd65a:i.c.c.e.s.b.RDDUtils@82] - (200) MapPartitionsRDD[49] at flatMapToPair at BaseRDDCollection.java:237 []
  |   MapPartitionsRDD[48] at flatMap at BaseRDDCollection.java:162 []
  |   MapPartitionsRDD[47] at map at BaseRDDCollection.java:157 []
  |   MapPartitionsRDD[46] at map at RDDCollection.java:289 []
  |   MapPartitionsRDD[45] at map at RDDCollection.java:288 []
  |   MapPartitionsRDD[44] at javaRDD at RDDCollection.java:287 []
  |   MapPartitionsRDD[43] at javaRDD at RDDCollection.java:287 []
  |   MapPartitionsRDD[42] at javaRDD at RDDCollection.java:287 []
  |   ZippedPartitionsRDD2[41] at javaRDD at RDDCollection.java:287 []
  |   MapPartitionsRDD[35] at javaRDD at RDDCollection.java:287 []
  |   ShuffledRowRDD[34] at javaRDD at RDDCollection.java:287 []
  +-(200) MapPartitionsRDD[33] at javaRDD at RDDCollection.java:287 []
      |   MapPartitionsRDD[32] at javaRDD at RDDCollection.java:287 []
      |   MapPartitionsRDD[28] at createDataFrame at RDDCollection.java:397 []
      |   MapPartitionsRDD[27] at map at RDDCollection.java:396 []
      |   MapPartitionsRDD[26] at map at RDDCollection.java:113 []
      |   MapPartitionsRDD[25] at flatMap at BaseRDDCollection.java:162 []
      |   MapPartitionsRDD[24] at toJavaRDD at RDDCollection.java:477 []
      |   MapPartitionsRDD[23] at toJavaRDD at RDDCollection.java:477 []
      |   MapPartitionsRDD[22] at toJavaRDD at RDDCollection.java:477 []
      |   MapPartitionsRDD[21] at toJavaRDD at RDDCollection.java:477 []
      |   MapPartitionsRDD[20] at toJavaRDD at RDDCollection.java:477 []
      |   ShuffledRowRDD[19] at toJavaRDD at RDDCollection.java:477 []
      +-(20) MapPartitionsRDD[16] at toJavaRDD at RDDCollection.java:477 []
         |   MapPartitionsRDD[15] at toJavaRDD at RDDCollection.java:477 []
         |   MapPartitionsRDD[14] at toJavaRDD at RDDCollection.java:477 []
         |   MapPartitionsRDD[13] at toJavaRDD at RDDCollection.java:477 []
         |   MapPartitionsRDD[12] at toJavaRDD at RDDCollection.java:477 []
         |   MapPartitionsRDD[11] at toJavaRDD at RDDCollection.java:477 []
         |   MapPartitionsRDD[10] at toJavaRDD at RDDCollection.java:477 []
         |   UnionRDD[9] at union at BaseRDDCollection.java:137 []
         |   MapPartitionsRDD[2] at flatMap at BaseRDDCollection.java:162 []
         |   MapPartitionsRDD[1] at flatMap at BatchSparkPipelineDriver.java:128 []
         |   DatasetRDD[0] at RDD at DatasetRDD.scala:52 []
         |   MapPartitionsRDD[8] at flatMap at BaseRDDCollection.java:162 []
         |   MapPartitionsRDD[7] at flatMap at BatchSparkPipelineDriver.java:128 []
         |   DatasetRDD[6] at RDD at DatasetRDD.scala:52 []
  |   MapPartitionsRDD[40] at javaRDD at RDDCollection.java:287 []
  |   ShuffledRowRDD[39] at javaRDD at RDDCollection.java:287 []
  +-(10) MapPartitionsRDD[37] at javaRDD at RDDCollection.java:287 []
     |   MapPartitionsRDD[36] at javaRDD at RDDCollection.java:287 []
     |   MapPartitionsRDD[31] at createDataFrame at RDDCollection.java:397 []
     |   MapPartitionsRDD[30] at map at RDDCollection.java:396 []
     |   MapPartitionsRDD[29] at map at RDDCollection.java:144 []
     |   MapPartitionsRDD[5] at flatMap at BaseRDDCollection.java:162 []
     |   MapPartitionsRDD[4] at flatMap at BatchSparkPipelineDriver.java:128 []
     |   DatasetRDD[3] at RDD at DatasetRDD.scala:52 []

DJSagarAhire avatar Aug 11 '22 15:08 DJSagarAhire

gitpod-io[bot] avatar Aug 11 '22 15:08 gitpod-io[bot]

Probably call setCallSite in the for loop of io.cdap.cdap.etl.spark.SparkPipelineRunner#runPipeline and clear in the end

tivv avatar Aug 11 '22 16:08 tivv

I am wondering if we can also put stage names into it by calling org.apache.spark.SparkContext#setCallSite. If would be 100x more valuable

Suggested changes have been added. Logs are now of the format:

2022-08-16 17:27:21,048 - DEBUG [spark-submitter-phase-1-90062f64-1d5a-11ed-82d6-00000035c3e1:i.c.c.e.s.b.RDDUtils@83] - (200) MapPartitionsRDD[49] at  []
  |   MapPartitionsRDD[48] at Joiner (Plugin Joiner of Type batchjoiner) []

DJSagarAhire avatar Aug 16 '22 12:08 DJSagarAhire