sparklens
sparklens copied to clipboard
Add the support for the sparklens in spark-3.0.0 and later version of spark with scala-2.12
In spark 3.0.0 there are lots of changes done by the community , so creating this improvement PR to make the sparklens work with spark 3.0.0 / new version of spark and scala-2.12.
I have made code change and also tested the same .please find the output
./bin/spark-shell --jars file:///tmp//src/opensrc/sparklens/target/scala-2.12/sparklens_2.12-0.4.0.jar --conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener --conf spark.eventLog.enabled=true
Printing application meterics. These metrics are collected at task-level granularity and aggregated across the app (all tasks, stages, and jobs).
AggregateMetrics (Application Metrics) total measurements 20
NAME SUM MIN MAX MEAN
diskBytesSpilled 0.0 KB 0.0 KB 0.0 KB 0.0 KB
executorRuntime 344.0 ms 3.0 ms 30.0 ms 17.0 ms
inputBytesRead 0.0 KB 0.0 KB 0.0 KB 0.0 KB
jvmGCTime 1.3 ss 0.0 ms 130.0 ms 65.0 ms
memoryBytesSpilled 0.0 KB 0.0 KB 0.0 KB 0.0 KB
outputBytesWritten 0.0 KB 0.0 KB 0.0 KB 0.0 KB
peakExecutionMemory 0.0 KB 0.0 KB 0.0 KB 0.0 KB
resultSize 17.5 KB 0.9 KB 0.9 KB 0.9 KB
shuffleReadBytesRead 0.0 KB 0.0 KB 0.0 KB 0.0 KB
shuffleReadFetchWaitTime 0.0 ms 0.0 ms 0.0 ms 0.0 ms
shuffleReadLocalBlocks 0 0 0 0
shuffleReadRecordsRead 0 0 0 0
shuffleReadRemoteBlocks 0 0 0 0
shuffleWriteBytesWritten 0.0 KB 0.0 KB 0.0 KB 0.0 KB
shuffleWriteRecordsWritten 0 0 0 0
shuffleWriteTime 0.0 ms 0.0 ms 0.0 ms 0.0 ms
taskDuration 4.3 ss 10.0 ms 441.0 ms 215.0 ms
Total Hosts 1, and the maximum concurrent hosts = 1
Host 192.168.1.19 startTime 09:32:50:262 executors count 1
Done printing host timeline
======================
Printing executors timeline....
Total Executors 1, and maximum concurrent executors = 1
At 09:32 executors added 1 & removed 0 currently available 1
Done printing executors timeline...
============================
Printing Application timeline
09:32:49:492 app started
09:32:57:059 JOB 0 started : duration 00m 00s
[ 0 |||||||||||||||||||||||||||||||||||||||||| ]
09:32:57:448 Stage 0 started : duration 00m 00s
09:32:57:889 Stage 0 ended : maxTaskTime 30 taskCount 10
09:32:57:899 JOB 0 ended
09:33:01:125 JOB 1 started : duration 00m 00s
[ 1 ||||||||||||||||| ]
09:33:01:132 Stage 1 started : duration 00m 00s
09:33:01:149 Stage 1 ended : maxTaskTime 6 taskCount 10
09:33:01:150 JOB 1 ended
09:33:03:044 app ended
Checking for job overlap...
JobGroup 1 SQLExecID (-1)
Number of Jobs 1 JobIDs(0)
Timing [09:32:57:059 - 09:32:57:899]
Duration 00m 00s
JOB 0 Start 09:32:57:059 End 09:32:57:899
JobGroup 2 SQLExecID (-1)
Number of Jobs 1 JobIDs(1)
Timing [09:33:01:125 - 09:33:01:150]
Duration 00m 00s
JOB 1 Start 09:33:01:125 End 09:33:01:150
No overlapping jobgroups found. Good
Time spent in Driver vs Executors
Driver WallClock Time 00m 12s 93.62%
Executor WallClock Time 00m 00s 6.38%
Total WallClock Time 00m 13s
Minimum possible time for the app based on the critical path (with infinite resources) 00m 12s
Minimum possible time for the app with same executors, perfect parallelism and zero skew 00m 12s
If we were to run this app with single executor and single core 00h 00m
Total cores available to the app 16
OneCoreComputeHours: Measure of total compute power available from cluster. One core in the executor, running
for one hour, counts as one OneCoreComputeHour. Executors with 4 cores, will have 4 times
the OneCoreComputeHours compared to one with just one core. Similarly, one core executor
running for 4 hours will OnCoreComputeHours equal to 4 core executor running for 1 hour.
Driver Utilization (Cluster idle because of driver)
Total OneCoreComputeHours available 00h 03m
Total OneCoreComputeHours available (AutoScale Aware) 00h 03m
OneCoreComputeHours wasted by driver 00h 03m
AutoScale Aware: Most of the calculations by this tool will assume that all executors are available throughout
the runtime of the application. The number above is printed to show possible caution to be
taken in interpreting the efficiency metrics.
Cluster Utilization (Executors idle because of lack of tasks or skew)
Executor OneCoreComputeHours available 00h 00m
Executor OneCoreComputeHours used 00h 00m 2.49%
OneCoreComputeHours wasted 00h 00m 97.51%
App Level Wastage Metrics (Driver + Executor)
OneCoreComputeHours wasted Driver 93.62%
OneCoreComputeHours wasted Executor 6.22%
OneCoreComputeHours wasted Total 99.84%
App completion time and cluster utilization estimates with different executor counts
Real App Duration 00m 13s
Model Estimation 00m 12s
Model Error 6%
NOTE: 1) Model error could be large when auto-scaling is enabled.
2) Model doesn't handles multiple jobs run via thread-pool. For better insights into
application scalability, please try such jobs one by one without thread-pool.
Executor count 1 (100%) estimated time 00m 12s and estimated cluster utilization 0.17%
Executor count 1 (110%) estimated time 00m 12s and estimated cluster utilization 0.17%
Executor count 1 (120%) estimated time 00m 12s and estimated cluster utilization 0.17%
Executor count 1 (150%) estimated time 00m 12s and estimated cluster utilization 0.17%
Executor count 2 (200%) estimated time 00m 12s and estimated cluster utilization 0.08%
Executor count 3 (300%) estimated time 00m 12s and estimated cluster utilization 0.06%
Executor count 4 (400%) estimated time 00m 12s and estimated cluster utilization 0.04%
Executor count 5 (500%) estimated time 00m 12s and estimated cluster utilization 0.03%
Total tasks in all stages 20
Per Stage Utilization
Stage-ID Wall Task Task IO% Input Output ----Shuffle----- -WallClockTime- --OneCoreComputeHours--- MaxTaskMem
Clock% Runtime% Count Input | Output Measured | Ideal Available| Used%|Wasted%
0 96.00 84.30 10 NaN 0.0 KB 0.0 KB 0.0 KB 0.0 KB 00m 00s 00m 00s 00h 00m 4.1 95.9 0.0 KB
1 3.00 15.70 10 NaN 0.0 KB 0.0 KB 0.0 KB 0.0 KB 00m 00s 00m 00s 00h 00m 19.9 80.1 0.0 KB
Max memory which an executor could have taken = 0.0 KB
Stage-ID WallClock OneCore Task PRatio -----Task------ OIRatio |* ShuffleWrite% ReadFetch% GC% *|
Stage% ComputeHours Count Skew StageSkew
0 96.29 00h 00m 10 0.63 1.03 0.07 0.00 |* 0.00 0.00 448.28 *|
1 3.71 00h 00m 10 0.63 1.00 0.35 0.00 |* 0.00 0.00 0.00 *|
PRatio: Number of tasks in stage divided by number of cores. Represents degree of
parallelism in the stage
TaskSkew: Duration of largest task in stage divided by duration of median task.
Represents degree of skew in the stage
TaskStageSkew: Duration of largest task in stage divided by total duration of the stage.
Represents the impact of the largest task on stage time.
OIRatio: Output to input ration. Total output of the stage (results + shuffle write)
divided by total input (input data + shuffle read)
These metrics below represent distribution of time within the stage
ShuffleWrite: Amount of time spent in shuffle writes across all tasks in the given
stage as a percentage
ReadFetch: Amount of time spent in shuffle read across all tasks in the given
stage as a percentage
GC: Amount of time spent in GC across all tasks in the given stage as a
percentage
If the stage contributes large percentage to overall application time, we could look into
these metrics to check which part (Shuffle write, read fetch or GC is responsible)```
cc @itskals @iamrohit @mayurdb
How you did sbt compile !!? I am facing this error on compilation, sbt.ResolveException: unresolved dependency: org.spark-packages#sbt-spark-package;0.2.4: not found
How you did sbt compile !!? I am facing this error on compilation, sbt.ResolveException: unresolved dependency: org.spark-packages#sbt-spark-package;0.2.4: not found
update resolver at path to https://repos.spark-packages.org/
How you did sbt compile !!? I am facing this error on compilation, sbt.ResolveException: unresolved dependency: org.spark-packages#sbt-spark-package;0.2.4: not found
update resolver at path to
https://repos.spark-packages.org/
Wow, that works !! Thanks @tayarajat
@SaurabhChawla100 Should we have a separate branch for Spark 3.0 instead?
I have created a new branch for Spark 3.0. If you think its better to have a separate branch please raise this against - https://github.com/qubole/sparklens/tree/SPARK30
@SaurabhChawla100 Should we have a separate branch for Spark 3.0 instead?
I have created a new branch for Spark 3.0. If you think its better to have a separate branch please raise this against - https://github.com/qubole/sparklens/tree/SPARK30
@mayurdb - I have changed the merge branch to SPARK30. But I think it's better to have branch2.x and we should merge the SPARK30 to master.
How did you run the sbt compile. I have updated the resolver URL to the one mentioned above. I still am getting unresolved dependency error
sbt.ResolveException: unresolved dependency: com.eed3si9n#sbt-assembly;0.12.0: not found [error] unresolved dependency: org.spark-packages#sbt-spark-package;0.2.4: not found
my project/plugin.sbt is
`addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
resolvers += "Spark Package Main Repo" at "https://repos.spark-packages.org/"
addSbtPlugin("org.spark-packages" % "sbt-spark-package" % "0.2.4")`
Any update on this? This would be great to have