Support for Spark 2.3/2.4 in Dr.Elephant
Currently, Dr.Elephant at max supports Spark 2.2.3. We need to support the latest versions of Spark(at least 2.3 and 2.4). This needs several changes, will update the issue as proceeds further.
@ShubhamGupta29 Just FYI: I managed to run https://github.com/songgane/dr-elephant/tree/feature/support_spark_2.x and it worked with spark 2.3+. There is couple of tests which needs to be fixed (I skipped them).
I have question:
-
It doesn't show
executor memory used
-
Same for GC statistics

Is it because in SHS 2.3 it's not available or there is some work needed to see these metrics.
Anyway glad to see progress for 2.3+ spark.
@mareksimunek I will surely go through your changes. Just some questions:
-
which spark version did you use for compilation (was that change done in compile.conf)
-
are you fetching data from EventLogs or SHS's REST API
For your Heuristics related issues, I need to check how are you retrieving and transforming data
- I used hadoop 2.3.0 and spark 2.1.2 https://github.com/songgane/dr-elephant/blob/feature/support_spark_2.x/compile.conf
I think, I tried rebase to current master and with higher versions there were more failing tests so I stick with 2.1 and skipped less tests :).
- I am fetching with:
<fetcher>
<applicationtype>spark</applicationtype>
<classname>com.linkedin.drelephant.spark.fetchers.SparkFetcher</classname>
<params>
<use_rest_for_eventlogs>true</use_rest_for_eventlogs>
<should_process_logs_locally>true</should_process_logs_locally>
<event_log_location_uri>/spark2-history/</event_log_location_uri>
<spark_log_ext>.snappy</spark_log_ext>
</params>
</fetcher>
@mareksimunek the issue you mentioned It doesn't show executor memory used, is it for every job or for some jobs value is available?
@ShubhamGupta29 every spark job, I suspect it's beacuse of this. https://github.com/songgane/dr-elephant/blame/feature/support_spark_2.x/app/org/apache/spark/deploy/history/SparkDataCollection.scala#L178
That info.memUsed is only available when the job is running, but I am not sure if Dr.Elephant is fetching this information. And when the job ended the information is gone. Because when I check SHS of completed job there is everywhere Peak memory: 0.
Spark history verison : 2.3.0.2.6.5.0-292 (its 2.3 with some HDP patches)
@ShubhamGupta29 every spark job, I suspect it's beacuse of this. https://github.com/songgane/dr-elephant/blame/feature/support_spark_2.x/app/org/apache/spark/deploy/history/SparkDataCollection.scala#L178
That
info.memUsedis only available when the job is running, but I am not sure if Dr.Elephant is fetching this information. And when the job ended the information is gone. Because when I check SHS of completed job there is everywhere Peak memory: 0.Spark history verison : 2.3.0.2.6.5.0-292 (its 2.3 with some HDP patches)
@mareksimunek Hello, buddy, I met the same case as yours: mem/executor/storage info can not be fetched from SHS once the job end. Have your problem solved? Thanks.
@mareksimunek @xglv1985 need one help from you guysin debugging the issue, can you confirm if the value of memUsed is Non-Zero in the response for REST API endpoint [/executors].
@ShubhamGupta29 yep its zero.
Checked http://someHost:18081/api/v1/applications/application_1587409317223_1104/1/executors
[ {
"id" : "driver",
"hostPort" : "someHost:37121",
"isActive" : true,
"rddBlocks" : 0,
"memoryUsed" : 0,
"diskUsed" : 0,
"totalCores" : 0,
"maxTasks" : 0,
"activeTasks" : 0,
"failedTasks" : 0,
"completedTasks" : 0,
"totalTasks" : 0,
"totalDuration" : 0,
"totalGCTime" : 0,
"totalInputBytes" : 0,
"totalShuffleRead" : 0,
"totalShuffleWrite" : 0,
"isBlacklisted" : false,
"maxMemory" : 407057203,
"addTime" : "2020-04-25T21:08:51.911GMT",
"executorLogs" : {
"stdout" : "http://someHost:8042/node/containerlogs/container_e54_1587409317223_1104_01_000001/fulltext/stdout?start=-4096",
"stderr" : "http://someHost:8042/node/containerlogs/container_e54_1587409317223_1104_01_000001/fulltext/stderr?start=-4096"
},
"memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 407057203,
"totalOffHeapStorageMemory" : 0
}
}, {
"id" : "9",
"hostPort" : "someHost2.dev.dszn.cz:33108",
"isActive" : true,
"rddBlocks" : 0,
"memoryUsed" : 0,
"diskUsed" : 0,
"totalCores" : 3,
"maxTasks" : 3,
"activeTasks" : 0,
"failedTasks" : 0,
"completedTasks" : 56,
"totalTasks" : 56,
"totalDuration" : 846816,
"totalGCTime" : 31893,
"totalInputBytes" : 0,
"totalShuffleRead" : 661719258,
"totalShuffleWrite" : 747129542,
"isBlacklisted" : false,
"maxMemory" : 3032481792,
"addTime" : "2020-04-25T21:09:08.100GMT",
"executorLogs" : {
"stdout" : "http://someHost2.dev.dszn.cz:8042/node/containerlogs/container_e54_1587409317223_1104_01_000011/fulltext/stdout?start=-4096",
"stderr" : "http://someHost2.dev.dszn.cz:8042/node/containerlogs/container_e54_1587409317223_1104_01_000011/fulltext/stderr?start=-4096"
},
"memoryMetrics" : {
"usedOnHeapStorageMemory" : 0,
"usedOffHeapStorageMemory" : 0,
"totalOnHeapStorageMemory" : 3032481792,
"totalOffHeapStorageMemory" : 0
}
}.....
Correction, its even reporting zero memoryUsed on running job trough SHS rest API. Should I set something to spark.executor.extraJavaOptions? To show me these stats?
From MR its getting memory stats from this setting. Am I right?
mapreduce.task.profile.params= -agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s
@mareksimunek @xglv1985 need one help from you guysin debugging the issue, can you confirm if the value of
memUsedis Non-Zero in the response for REST API endpoint [/executors].
@ShubhamGupta29 yes, I also proved it. The mem field is 0 in response json
@mareksimunek @xglv1985 thanks for the prompt response, I am able to support Spark2.3 and will make the changes public soon. I am debugging this memUsed = 0 issue as this problem is still there with Spark2.3. I am debugging the issue and will be in contact with you guys.
One more query I have for you, can you paste here the values you are getting for these metrics in /executor API response.
Metrics:
"memoryMetrics" : { "usedOnHeapStorageMemory" "usedOffHeapStorageMemory" "totalOnHeapStorageMemory" "totalOffHeapStorageMemory" }
@mareksimunek @xglv1985 thanks for the prompt response, I am able to support Spark2.3 and will make the changes public soon. I am debugging this
memUsed = 0issue as this problem is still there with Spark2.3. I am debugging the issue and will be in contact with you guys. One more query I have for you, can you paste here the values you are getting for these metrics in/executorAPI response. Metrics:"memoryMetrics" : { "usedOnHeapStorageMemory" "usedOffHeapStorageMemory" "totalOnHeapStorageMemory" "totalOffHeapStorageMemory" }
sure:
"memoryMetrics" : { "usedOnHeapStorageMemory" : 0, "usedOffHeapStorageMemory" : 0, "totalOnHeapStorageMemory" : 1099746508, "totalOffHeapStorageMemory" : 4000000000 }
@mareksimunek @xglv1985 thanks for the prompt response, I am able to support Spark2.3 and will make the changes public soon. I am debugging this
memUsed = 0issue as this problem is still there with Spark2.3. I am debugging the issue and will be in contact with you guys. One more query I have for you, can you paste here the values you are getting for these metrics in/executorAPI response. Metrics:"memoryMetrics" : { "usedOnHeapStorageMemory" "usedOffHeapStorageMemory" "totalOnHeapStorageMemory" "totalOffHeapStorageMemory" }
By the way, @ShubhamGupta29 I use dr.elephant to analyze spark 2.3 event log, and every job analysis result is as the follow. I found except "Spark Configuration", every field is empty. Is this normal? Thanks!
Spark Configuration Severity: Moderate [Explain]
spark.application.duration | -1587978750 Seconds spark.driver.cores | 4 spark.driver.memory | 4 GB spark.dynamicAllocation.enabled | false spark.executor.cores | 4 spark.executor.instances | 20 spark.executor.memory | 4 GB spark.shuffle.service.enabled | false Spark shuffle service is not enabled. spark.yarn.driver.memoryOverhead | 0 B spark.yarn.executor.memoryOverhead | 0 B
Spark Executor Metrics Severity: None
Executor input bytes distribution | min: 0 B, p25: 0 B, median: 0 B, p75: 0 B, max: 0 B Executor shuffle read bytes distribution | min: 0 B, p25: 0 B, median: 0 B, p75: 0 B, max: 0 B Executor shuffle write bytes distribution | min: 0 B, p25: 0 B, median: 0 B, p75: 0 B, max: 0 B Executor storage memory used distribution | min: 0 B, p25: 0 B, median: 0 B, p75: 0 B, max: 0 B Executor storage memory utilization rate | 0.000 Executor task time distribution | min: 0 sec, p25: 0 sec, median: 0 sec, p75: 0 sec, max: 0 sec Executor task time sum | 0 Total executor storage memory allocated | 1.96 GB Total executor storage memory used | 0 B
Spark Job Metrics Severity: None
Spark completed jobs count | 0 Spark failed jobs count | 0 Spark failed jobs list | Spark job failure rate | 0.000 Spark jobs with high task failure rates
Spark Stage Metrics Severity: None
Spark completed stages count | 0 Spark failed stages count | 0 Spark stage failure rate | 0.000 Spark stages with high task failure rates | Spark stages with long average executor runtimes
Executor GC Severity: None
GC time to Executor Run time ratio | NaN Total Executor Runtime | 0 Total GC time | 0
@xglv1985 no this is not normal. Can you tell me which branch or source code are you using?
@xglv1985 no this is not normal. Can you tell me which branch or source code are you using?
@ShubhamGupta29 dr-elephant_987
Can you provide the link as linkedin/dr-elephant doesn't have any branch named dr-elephant_987
Can you provide the link as
linkedin/dr-elephantdoesn't have any branch nameddr-elephant_987
@ShubhamGupta29 I forked my own dr-elephant from linkedin/dr-elephant master. I only put "SparkFetcher" in my conf xml file, with <use_rest_for_eventlogs>true</use_rest_for_eventlogs> <should_process_logs_locally>true</should_process_logs_locally>. Is there any other configuration that may cause these empty field? I will debug more deeply. Thanks
@xglv1985 if you are using current master, you can't see any metrics from spark 2.3+ More in: https://github.com/linkedin/dr-elephant/issues/389 Check your logs there will be some parsing error. That's why i am using fork as said above.
That's why there is ongoing work from @ShubhamGupta29 to support this version.
Metrics:
"memoryMetrics" : { "usedOnHeapStorageMemory" "usedOffHeapStorageMemory" "totalOnHeapStorageMemory" "totalOffHeapStorageMemory" }
Thanks for update @ShubhamGupta29 They are already included in my post https://github.com/linkedin/dr-elephant/issues/683#issuecomment-619609445
@xglv1985 if you are using current master, you can't see any metrics from spark 2.3+ More in: #389 Check your logs there will be some parsing error. That's why i am using fork as said above.
That's why there is ongoing work from @ShubhamGupta29 to support this version.
Metrics:
"memoryMetrics" : { "usedOnHeapStorageMemory" "usedOffHeapStorageMemory" "totalOnHeapStorageMemory" "totalOffHeapStorageMemory" }Thanks for update @ShubhamGupta29 They are already included in my post #683 (comment)
@mareksimunek Thanks very much, I saw the same problem with mine, in link you gave. Then let's looking forward to the updated dr-elephant by @ShubhamGupta29
@mareksimunek @xglv1985, I have made the changes for Spark2.3 (these are the foundation changes, will fix the tests and other cleanups in some time). If possible can you guys try this personal branch, it has changes for Spark2.3.
@ShubhamGupta29 nice, the ShubhamGupta29/test23 works like a charm. It now even shows GC stats.
executor memory used still not showing, but I suppose if it's not available in SHS it won't be seen in elephant. (Do you have any news if there is something to do to make it available in SHS)

@mareksimunek working on the same, after going through Spark's code got some idea of why this metric is not getting populated. For now, testing the changes and soon add those to the branch and also trying to support Spark 2.4 too. @mareksimunek and @xglv1985, can you guys fill the survey in #685, it would be helpful for us to make Dr.Elephant more OS community-friendly.
ok I saw it yesterday and I will fill the survey today.
| | hikari | | 邮箱:[email protected] |
Signature is customized by Netease Mail Master
On 04/30/2020 10:08, Shubham Gupta wrote:
@mareksimunek working on the same, after going through Spark's code got some idea of why this metric is not getting populated. For now, testing the changes and soon add those to the branch and also trying to support Spark 2.4 too. @mareksimunek and @xglv1985, can you guys fill the survey in #685, it would be helpful for us to make Dr.Elephant more OS community-friendly.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
@xglv1985 did you get a chance to use the changes done for Spark2.3? Feedback for the changes will make it easy to start the effort for merging the changes to the master branch for users' ease.
sure, I will try your personal branch, and will feedback to you during May 1st to 5th.
At 2020-04-30 14:37:06, "Shubham Gupta" [email protected] wrote:
@xglv1985 did you get a chance to use the changes done for Spark2.3? Feedback for the changes will make it easy to start the effort for merging the changes to the master branch for users' ease.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
@mareksimunek and @xglv1985 I have made some more changes for Spark 2.3 support, kindly try this branch whenever you guys have time. Also for memory heuristics, there is a change needed in Spark Conf, add spark.eventLog.logBlockUpdates.enabled if not there already and make this value as true.
@ShubhamGupta29 Hi, thanks for update and sorry for late response.
- I needed delete some tests to compile.
deleted: test/com/linkedin/drelephant/tony/fetchers/TonyFetcherTest.java
deleted: test/com/linkedin/drelephant/tuning/PSOParamGeneratorTest.java
- I missed that it's storage memory which is showing cached RDDs. And it seems it works fine, after setting
spark.eventLog.logBlockUpdates.enabledto job :) - Is there also way to add peak memory? In which I most interested in.
I noticed in event log there is:
{"ID":111,"Name":"internal.metrics.peakExecutionMemory","Update":96381057,"Value":96381057,"Internal":true,"Count Failed Values":true}
@mareksimunek thanks for the reply and testing out the provided version.
-
TonYFetcherTest doesn't work when compiling locally so it makes sense to remove it. For
PSOParamGeneratorTestif you want to try you can fix it by doingpip install inspyredas it fixed the test for me. -
Glad after setting
spark.eventLog.logBlockUpdates.enabledthe metric is getting populated for you. -
I am also looking for a way to provide this metric (Peak Memory Used), can you provide me the event name from which you got this metrics(
internal.metrics.peakExecutionMemory).
Also, let me know any other issue you are facing or any suggestion you have for Dr.Elephant. Hope Dr.Elephant is proving useful for you and your team.
@ShubhamGupta29
- Event name for
internal.metrics.peakExecutionMemoryused is called"Event":"SparkListenerTaskEnd"But I am not sure if thats it, only juding by its name :). I attached event log from the job. eventLogs-application_1587409317223_6508-1.zip
So far it seems it's working like a charm. I am trying to push it through in our team (now its running on small testing cluster) and with working spark metrics it will be much easier to get approval to work on that, thanks for the progress.
Question: Are you using 1 dr elephant installation per cluster or do you have 1 dr elephant analyzing more clusters.
The current Dr.Elephant allow the analysis of jobs only from single RM(single cluster).
@mareksimunek and @xglv1985 I have made some more changes for Spark 2.3 support, kindly try this branch whenever you guys have time. Also for memory heuristics, there is a change needed in Spark Conf, add
spark.eventLog.logBlockUpdates.enabledif not there already and make this value astrue.
@ShubhamGupta29 First sorry for the late response. Thanking for your branch feature_spark2.3, I now have run it up. This is my screen capture:
The good new is that it has more dimensions than the past versions of dr.elephant. But the detail of each dimension has disappeared, and I will double check the configuration.