sparklens icon indicating copy to clipboard operation
sparklens copied to clipboard

How to understand the spark-lens statistics. Is there any wiki?

Open normalscene opened this issue 5 years ago • 16 comments

Hi @mayurdb / @iamrohit / @rishitesh

We have generated our application's performance report but we are having trouble understanding the output report (html). We have generated the JSON file and we uploaded it to your reporting service which generated a nice HTML report (http://sparklens.qubole.com/report_view/616b961beb11dc0298b1)

Could you please direct us to any type of documentation which explains how to interpret the output HTML report ? Thank you in advance.

Gaurav

normalscene avatar Feb 05 '20 13:02 normalscene

update: I was able to find some explanation about the reports on (https://www.qubole.com/blog/introducing-sparklens-report/). This is fine.

If you take a look at the generated report, there are some values in negative (eg: under "efficiency statistics" > "Driver vs Executor Wallclock Time").

We are not able to understand why the values are negative and regarding this - there is no mention on the said blog site.

normalscene avatar Feb 05 '20 13:02 normalscene

Hello - I am following up on my question. Thanks.

normalscene avatar Feb 07 '20 08:02 normalscene

Hi, I am seeking help on the generated report. Please advise.

normalscene avatar Feb 10 '20 05:02 normalscene

Hi, @normalscene

The negative values of the driver/executor wall-clock time might be an indication of event loss while running the spark application. In the newer releases of the Sparklens, we have handled the event loss gracefully. Were you using the older version while generating the sparklens.json? If so, I would suggest you try with the latest version.

If you are already using the latest version, can you please try to generate the report once via the command line instead of using sparklens.qubole.com. This will help us narrow down the issue to sparklens reporting service or sparklens core. You can find the steps to generate the sparklens command-line report in the ReadMe.

mayurdb avatar Feb 10 '20 06:02 mayurdb

Hello @mayurdb

The current version being used is as follows:

gaurav.arya@UB0013:~/mycode/myScalaStuff/sparklens$ cat version.sbt 
version in ThisBuild := "0.3.1"
gaurav.arya@UB0013:~/mycode/myScalaStuff/sparklens$ 

I believe this is the latest version as I took it recently (less than 2 weeks). Please correct me if I am incorrect here.

Unfortunately I am having difficulty spotting the command line reporting steps. Could you kindly just point me towards the section no. ?

Just for record sake, here is my command which I am using to submit my job:

gcloud dataproc jobs submit spark \
	--project our-project \
	--cluster deltest \
	--region us-central1  \
	--properties \
			spark.submit.deployMode=cluster,\
			spark.dynamicAllocation.enabled=false,\
			spark.yarn.maxAppAttempts=1,\
			spark.driver.memory=4G,\
			spark.driver.memoryOverhead=1024m,\
			spark.executor.instances=3,\
			spark.executor.memoryOverhead=1024m,\
			spark.executor.memory=4G,\
			spark.executor.cores=2,\
			spark.driver.cores=1,\
			spark.driver.maxResultSize=2g,\
			spark.extraListeners=com.qubole.sparklens.QuboleJobListener \
	--class com.figmd.someProduct.deletion.dataCleanerMain \
	--jars=gs://our_repository/jars/newDataCleaner.jar,\
			gs://spark-lib/bigquery/spark-bigquery-latest.jar,\
			gs://our_repository/jars/jdbc-postgresql.jar,\
			gs://our_repository/jars/postgresql-42.2.5.jar,\
			gs://our_repository/jars/sparklens_2.11-0.3.1.jar \
	-- cdr 289 AAAA,BBBB our-project CDRDELTEST 20200121 0001

normalscene avatar Feb 10 '20 06:02 normalscene

With the above command itself, you should be able to see the report in the driver logs after application completion.

mayurdb avatar Feb 10 '20 06:02 mayurdb

Oh! Understood. I will have to re-run the job (it takes around 75 minutes) and I will get back after that.

thanks for your prompt reply.

Regards, Gaurav

normalscene avatar Feb 10 '20 06:02 normalscene

To get more accurate results, before you re-run see if there was an event loss in the previous runs. grep for key-word "Dropped" in the driver logs. If there was an event loss, try increasing the event queue capacity to a higher value and also increase the driver's memory. This is the conf: spark.scheduler.listenerbus.eventqueue.capacity

mayurdb avatar Feb 10 '20 06:02 mayurdb

@mayurdb

I have re-run the job (but this time, took a smaller data sample to make the processing a little quick compared to earlier one, which took around 75 minutes).

JSON reporting & command-line statistics - both were enabled.

Here are my observations:

  1. Looked for "Dropped" (case insensitive search) in the logs but we didn't find it. So I guess nothing was dropped.

  2. With or without spark.scheduler.listenerbus.eventqueue.capacity the output reports are showing negative values.

Please advise which logs can I share with you to help investigate this.

Edit: Does this seem like a bug ? Please confirm.

Links to generated reports:

http://sparklens.qubole.com/report_view/0cfede16246e4423c402 (without drop queue setting) http://sparklens.qubole.com/report_view/447874de4f73db369339 (with drop queue setting)

normalscene avatar Feb 10 '20 12:02 normalscene

@normalscene The negative numbers can show up for various reasons. One of the things we try to do is to find out how much time executors were busy vs when drivers was busy. We don't really have any metrics around what the driver is doing in the event log. This is estimated by subtracting the time spent in executors from the total execution time. The negative numbers show up when our estimate of time spent in executor end up exceeding the total application time. Main difficulty is figuring out which stages ran in parallel. When we make a mistake in this part, we end up adding time spent instead of taking max among stages. As @mayurdb mentioned another reason is loss of events, where the time doesn't adds up because some events are missing. Please share the event log file and I can try figuring this out. thanks.

iamrohit avatar Feb 14 '20 01:02 iamrohit

@iamrohit

Could you please point me towards the "event log file". I believe I am not aware of its whereabouts. Kindly assist and I shall make it available ASAP. Thanks in advance.

Cheers, Gaurav

normalscene avatar Feb 14 '20 10:02 normalscene

@normalscene

Your will need to add the following config options to your spark-submit. spark.eventLog.enabled (true) spark.eventLog.dir (s3:///some/where/)

After the application is complete, event log file will be available in the spark.eventLog.dir

iamrohit avatar Feb 17 '20 06:02 iamrohit

@iamrohit

Thank you. I will try this out and get back with results shortly.

normalscene avatar Feb 17 '20 06:02 normalscene

@iamrohit

I gave it a shot and unfortunately it didn't generate any logs :(.

Allow me to share the command I used just to double check whether I fired the right command or not. Have I missed something?

gcloud dataproc jobs submit spark 
	--project bda-sandbox 
	--cluster cdrperf 
	--region us-central1 
	--properties 
		spark.eventLog.enabled=true,
		spark.eventLog.dir=gs://cdrmigration/spark-lens-reports/cdrperftest/eventlogs,
		spark.sparklens.reporting.disabled=false,
		spark.sparklens.data.dir=gs://cdrmigration/spark-lens-reports/cdrperftest,
		spark.extraListeners=com.qubole.sparklens.QuboleJobListener,
		spark.submit.deployMode=cluster,
		spark.dynamicAllocation.enabled=false,
		spark.yarn.maxAppAttempts=1,
		spark.driver.memory=36G,
		spark.driver.memoryOverhead=4096m,
		spark.executor.instances=3,
		spark.executor.memoryOverhead=4096m,
		spark.executor.memory=36G,
		spark.executor.cores=8,
		spark.driver.cores=4,
		spark.driver.maxResultSize=10g 
	--jars=gs://cdrmigration/jars/cdrprocprod.jar,gs://cdrmigration/jars/jdbc-postgresql.jar,gs://cdrmigration/jars/postgresql-42.2.5.jar,gs://cdrmigration/jars/sparklens_2.11-0.3.1.jar 
	--class com.figmd.janus.filetocdr.CDRProcessing 
	-- "registry_name" "cdrmigration" jjjjjjjj-cf66-428b-bf02-caa1f608bf36 51dce49e-854a-4f54-8127-34472b12a5ac kkkkkkkk-854a-4f54-8127-34472b12a5ac

normalscene avatar Feb 18 '20 07:02 normalscene

@iamrohit

I have captured the sparklens command-line output. It doesn't give us any indication whether if the event log capture was invoked or not.

normalscene avatar Feb 18 '20 15:02 normalscene

@normalscene Not really much aware of the internal workings of dataproc. If you have access to history server, newer versions allow downloading this file directly from the history server UI.

iamrohit avatar Feb 20 '20 06:02 iamrohit