dr-elephant icon indicating copy to clipboard operation
dr-elephant copied to clipboard

Unable to get correct metrics for spark

Open Parth59 opened this issue 6 years ago • 27 comments

Hi, I know that enabling spark to work correctly with dr-elephant is already mentioned in #327 . Following the details I tried using spark-rest client through which I am able to extract spark jobs data. But in the UI all spark metrics are displayed as 0 as attached in the screenshot. Can anyone please post the steps for configuring spark 2.x to work correctly with Dr-Elephant.

screen shot 2018-05-24 at 5 29 24 am

Parth59 avatar May 24 '18 12:05 Parth59

Hi, I have the same question, only spark configuration metric being flagged(amber asking for further tuning for some of the jobs ), stats for rest of all is mostly green with null/0 values in metric stats. Is it a config issue or the spark2.x - elephant compatibility issue?

Regards

simul-tion avatar May 31 '18 20:05 simul-tion

U should modify config set spark version into 2.x , then , update U spark cord API for 2.x ~ now there has no spark 2.x API interface,U should modify it by u self , hope help to U

Seandity avatar Jun 02 '18 13:06 Seandity

@Seandity Thanks for your response, can I request you to be more specific. I'm not sure I understand what is the proposed approach. Can you share an example probably of what you are directing to?

Regards

simul-tion avatar Jun 04 '18 14:06 simul-tion

Hi,

@Parth59 Did you find a way through this?

@Seandity @shkhrgpt @akshayrai Taking a reference from #327 Can someone confirm that these metrics are dependent on the open item "SPARK-23206"? If not really appreciate if you can share how to sort this out.

Regards

simul-tion avatar Jun 05 '18 14:06 simul-tion

Hi,

Appreciate if someone can suggest on this .

Regards

simul-tion avatar Jun 06 '18 19:06 simul-tion

Sorry for the late response. This issue is happening due to the fact that Dr elephant does not support Spark 2.X apps. What's making it confusing is that you are seeing Spark 2.X apps on Dr Elephant UI but data is not complete. You are seeing the incomplete data because the fetcher is partially processing the data, and instead of failing it's using the partial data to analyze the result. If you inspect dr elephant logs then you will see the parsing exceptions. I hope this helps.

shkhrgpt avatar Jun 06 '18 20:06 shkhrgpt

Hi,

@shkhrgpt do we have a work around for this issue? How do we enable parsing of spark 2.x logs correctly?

prachi2396 avatar Jun 07 '18 08:06 prachi2396

@shkhrgpt thanks for the update.

Observation and questions

  1. I just verified spawning a new instance of dr elephant against spark1.6 and one of the spark job showed to have suggestions from elephant about other metrics as well.

  2. If dr elephant doesn't support 2.x, then what does #327 point to , can you clarify?

  3. Do you have a visibility of what effort is needed to have elephant support 2.x or is there something already in pipe , assuming it doesn't support 2.x as of now.

thanks in advance.

Regards

simul-tion avatar Jun 07 '18 15:06 simul-tion

Hello @akshayrai @shkhrgpt

Appreciate if you could please clarify this.

Regards

simul-tion avatar Jun 10 '18 01:06 simul-tion

I don't think there is an easy workaround to support Spark 2.x

PR #327 recommends using custom Spark History Server (SHS) which will provide the stable REST APIs to support Spark 2.x in dr elephant. However, as far as I know, all the changes required to this custom SHS are not checked into the open source Spark project. Maybe @akshayrai can provide more detail about this.

I think to support Spark 2.x we need to extend the parser which parses event logs. Most of the parsing logic is implemented in SparkDataCollection class. This class uses various Spark listeners to replay the event logs. The issue is that SparkDataCollection class assumes Spark 1.6 when uses listeners and other related Spark classes. To support Spark 2.x, we can either make SparkDataCollection compatible with Spark 2.x too or make it independent for Spark version.

shkhrgpt avatar Jun 10 '18 19:06 shkhrgpt

As mentioned by @shkhrgpt, this is caused by the inability to parse the SHS event log based on spark 2.x. SparkDataCollection class processes SHS event log based on spark 1.4, but spark 2.x will cause error due to newly added SHS event log.

As a temporary measure, I changed the sbt spark version to 2.2.1 or less, and when calling the ReplayListenerBus replay, I made use of the ReplayEventsFilter to support spark 2.x with the exception of the newly added SHS event log in 2.x.

If you do not mind, why not refer to my forked source (https://github.com/songgane/dr-elephant/tree/feature/support_spark_2.x) ?

songgane-zz avatar Jun 23 '18 13:06 songgane-zz

Thanks, @songgane for sharing your fix for Spark 2.x support. Would it possible for you to submit a PR for this change?

shkhrgpt avatar Jun 24 '18 04:06 shkhrgpt

@songgane I tried shared source, but it fails to compile. Is there anything I should be modifying before I compile. (I tried to do so with default configuration - Spark - 1.4 and Hadooop - 2.3)

simul-tion avatar Jul 02 '18 13:07 simul-tion

@songgane @shkhrgpt @akshayrai I am running dr elephant on my spark cluster. However i see only below metrics for my jobs. Spark Configuration Spark Executor Metrics Spark Job Metrics Spark Stage Metrics Executor GC

Do we have some additional metrics on cpu/memory utilization here?

ritika11 avatar Jul 04 '18 09:07 ritika11

@ritika11 Are you able to view the metrics with Spark 2+ or spark 1.6? I am still getting metrics with value 0 with spark2 .

ethanhunt07 avatar Jul 04 '18 13:07 ethanhunt07

@songgane I tried your code but it fails to compile ? Do we need to compile it with spark 2+ or 1 ?

ethanhunt07 avatar Jul 04 '18 13:07 ethanhunt07

@ethanhunt07 yes i am able to view the spark heuristics with Spark 2+. However I am looking for options on adding more metrics and heuristics in the code ...

ritika11 avatar Jul 04 '18 15:07 ritika11

@ethanhunt07 To get spark aggregation metric values, you need to set spark.executor.instances and spark.executor.memory option. Did you set that options? If the spark.executor.instances and spark.executor.memory options are not present, the resulting value is zero. spark default property value is not used.

songgane-zz avatar Jul 04 '18 15:07 songgane-zz

@Pravdeep Did you compile feature/support_spark_2.x branch? I used spark 2.1.2 and hadoop 2.3.0. Because my code use spark 2.x feature, spark version need to set 2.x+.

songgane-zz avatar Jul 04 '18 15:07 songgane-zz

@songgane Compilation failed with - Spark - 1.4 and Hadooop - 2.3 So I tried with Spark - 2.1.2 and Hadooop - 2.3.0, compilation fails unable to resolve dependencies, is there anything else that needs to be set to be able to compile your source code?(Attaching the logs) Log.txt

simul-tion avatar Jul 18 '18 15:07 simul-tion

@Pravdeep By your log message, it seems that there is a problem with the certificate. You must add a valid certificate . If you google the error message, you will get a lot of information.

[error] Server access Error: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target url=https://repo1.maven.org/maven2/org/apache/geronimo/specs/geronimo-jms_1.1_spec/1.1.1/geronimo-jms_1.1_spec-1.1.1.pom

songgane-zz avatar Jul 19 '18 11:07 songgane-zz

@songgane Yep I noticed, but do you know why your build requires or has this dependency, I didn't have to include any certs when i built generally available dr-elephant branch(which i'm currently running) Do I need to take care of any conf changes while building your source or is it possible for you to share already built version of your branch which supports spark2.x metrics?

simul-tion avatar Jul 19 '18 16:07 simul-tion

@Pravdeep The difference between my branch and dr-elephant master branch is nothing more than a library version. Certificate problems are mainly caused by your build environment problems. Perhaps it is because the libraries are managed privately by using a library management system such as Nexus, or the https service is blocked due to security issues.

songgane-zz avatar Jul 23 '18 07:07 songgane-zz

Almost get the same result as Parth's. I compiled dr. E. with hadoop 2.7.6 and spark 1.6.2 and run on hadoop 2.7.6 and spark 2.3.0. It's OK with hadoopjava jobs but spark jobs. I have checked the dr_elephant.log as follows:

08-10-2018 14:46:41 INFO [dr-el-executor-thread-1] org.apache.spark.deploy.history.SparkFSFetcher$ : Replaying Spark logs for application: application_1533540053870_0023 withlogPath: webhdfs://algo:50070/tmp/spark/events/application_1533540053870_0023.lz4 with codec:Some(org.apache.spark.io.LZ4CompressionCodec@4f5f47dd)

It did not report any error on replaying the spark event log. But the heuristics seem to be all 0's. Here is part of the event log's content. Is there something wrong?

{"Event":"SparkListenerExecutorAdded","Timestamp":1533723506979,"Executor ID":"1","Executor Info":{"Host":"algo","Total Cores":1,"Log Urls":{"stdout":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000002/algo/stdout?start=-4096","stderr":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000002/algo/stderr?start=-4096"}}} {"Event":"SparkListenerTaskStart","Stage ID":0,"Stage Attempt ID":0,"Task Info":{"Task ID":0,"Index":0,"Attempt":0,"Launch Time":1533723506982,"Executor ID":"1","Host":"algo","Locality":"NODE_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Killed":false,"Accumulables":[]}} {"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"1","Host":"algo","Port":34015},"Maximum Memory":4392694579,"Timestamp":1533723507032,"Maximum Onheap Memory":4392694579,"Maximum Offheap Memory":0} {"Event":"SparkListenerExecutorAdded","Timestamp":1533723508088,"Executor ID":"2","Executor Info":{"Host":"algo","Total Cores":1,"Log Urls":{"stdout":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000003/algo/stdout?start=-4096","stderr":"http://algo:8042/node/containerlogs/container_1533540053870_0015_01_000003/algo/stderr?start=-4096"}}} {"Event":"SparkListenerTaskStart","Stage ID":0,"Stage Attempt ID":0,"Task Info":{"Task ID":1,"Index":1,"Attempt":0,"Launch Time":1533723508089,"Executor ID":"2","Host":"algo","Locality":"NODE_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Killed":false,"Accumulables":[]}} {"Event":"SparkListenerBlockManagerAdded","Block Manager ID":{"Executor ID":"2","Host":"algo","Port":41386},"Maximum Memory":4392694579,"Timestamp":1533723508146,"Maximum Onheap Memory":4392694579,"Maximum Offheap Memory":0}

Windyhe avatar Aug 10 '18 06:08 Windyhe

@Windyhe if you can't set spark.executor.instances or spark.executor.memory value, aggregation don't work.

songgane-zz avatar Aug 31 '18 01:08 songgane-zz

@songgane Hi,how to set spark.executor.instances and spark.executor.memory value? Thanks!

I add it to spark-defaults.conf. But in the UI all spark metrics are displayed as 0...

image image

YunKillerE avatar Oct 29 '18 08:10 YunKillerE

@ethanhunt07 yes i am able to view the spark heuristics with Spark 2+. However I am looking for options on adding more metrics and heuristics in the code ...

Can you please share your Fetcher.xml file content. I am also trying to analyze spark 2.3 jobs, though facing issue [error] o.a.s.s.ReplayListenerBus - Exception parsing Spark event log: application_1510469066221_0020 org.json4s.package$MappingException: Did not find value which can be converted into boolean. Help is much appreciated.

ankurchourasiya avatar Nov 17 '18 16:11 ankurchourasiya