spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-37340][UI] Display StageIds in Operators for SQL UI

Open yliou opened this issue 3 years ago • 16 comments

What changes were proposed in this pull request?

Add explicit stageId to operator mapping in the Spark UI that is a more general version of https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator mapping is done with the following algorithm.

  1. Read SparkGraph to get every Node's name and respective AccumulatorIDs.
  2. Gets each stage's AccumulatorIDs.
  3. Maps Operators to stages by checking for non-zero intersection of Step 1 and 2's AccumulatorIDs.
  4. Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI. As a result, some operators without max metrics values will also have stageIds in the UI. In some cases, there is no operator->StageID mapping made because no stageIds have accumulatorIds that are a part of the Operator's accumulatorIds. URL links at the top to go to the succeeded jobs and completed stages that were executed as a part of the selected query are also provided.

Why are the changes needed?

Makes for easier and quicker debugging and navigation.

Does this PR introduce any user-facing change?

Yes, Succeeded Jobs: and Completed Stages:listed at the top of the UI, along with Stages: in some of the operators. Screen Shot 2021-11-16 at 11 35 51 AM

How was this patch tested?

Manual test locally in SQL UI.

yliou avatar Nov 16 '21 19:11 yliou

Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49760/

SparkQA avatar Nov 16 '21 20:11 SparkQA

Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49760/

SparkQA avatar Nov 16 '21 21:11 SparkQA

Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49763/

SparkQA avatar Nov 16 '21 23:11 SparkQA

Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49763/

SparkQA avatar Nov 17 '21 00:11 SparkQA

Test build #145290 has finished for PR 34622 at commit 3489292.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class StageAttempt(
  • case class GraphNodeToStages(

SparkQA avatar Nov 17 '21 00:11 SparkQA

Test build #145293 has finished for PR 34622 at commit 5734754.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class StageAttempt(
  • case class GraphNodeToStages(

SparkQA avatar Nov 17 '21 03:11 SparkQA

cc @tgravescs would this feature be of interest?

yliou avatar Nov 22 '21 18:11 yliou

cc @sarutak and @gengliangwang FYI

HyukjinKwon avatar Nov 23 '21 00:11 HyukjinKwon

yes, it would be nice to have the actual stagIds in the ui, I'll need to look closer at the logic though, which likely won't be til next week.

tgravescs avatar Nov 23 '21 15:11 tgravescs

Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50214/

SparkQA avatar Nov 30 '21 04:11 SparkQA

Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50214/

SparkQA avatar Nov 30 '21 05:11 SparkQA

@tgravescs do you have time to take a quick look?

yliou avatar Mar 03 '22 22:03 yliou

So a couple questions and concerns.

  1. I'm not sure having the list of Completed Stages at the top of the page helps and I'm concerned that could be a very long list. You can simply click on the job or go to the stages page to get there. You could also make the Stages link in the operator box clickable.
  2. what does this list for exchanges where the exchange crosses 2 stages? Similar hash aggregates.
  3. have you run this more than just a couple local jobs? ie on large jobs or in production?

I'll try to get this built and try it out locally

tgravescs avatar Mar 09 '22 19:03 tgravescs

Thanks for the feedback comments so far.

  1. I originally added the list of completed stages for convenience on the SQL UI. Do you think it's worth removing the list of Completed Stages?
  2. In this case, multiple stages will show up in the operator box. It would look like image
  3. I've run this in production at Workday.

yliou avatar Mar 11 '22 01:03 yliou

sorry, I missed your response.

After looking some more the list of jobs can get very long as well. I'm fine with leaving the list of stages as well.

Do you know how long the connectOperatorToStage takes for a query with larger graph?

tgravescs avatar Apr 05 '22 13:04 tgravescs

sorry, I missed your response.

After looking some more the list of jobs can get very long as well. I'm fine with leaving the list of stages as well.

Do you know how long the connectOperatorToStage takes for a query with larger graph?

Unfortunately I don't know how long it takes for connectOperatorToStage with queries with a larger path. I haven't seen issues regarding to runtime here at least.

yliou avatar Apr 07 '22 01:04 yliou

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Dec 02 '22 00:12 github-actions[bot]

@tgravescs @martin-g should I create another pull request for this feature to try to get it merged? I'm unable to reopen the PR.

yliou avatar Jun 08 '23 19:06 yliou

sure, its been a while but I think I had tried this out and was seeing some performance issues with it. I'd have to relook at it to remember. Did you run any performance tests?

tgravescs avatar Jun 08 '23 20:06 tgravescs

No, I don't know what sort of performance tests should be run for this feature.

yliou avatar Jun 09 '23 21:06 yliou

yeah I believe I was implementing similar functionality in a tool we have and that algorithm had performance issues when things got large. I'll have to go dig it up again though.

I assume this change isn't something you are running with personally or in some production workloads?

tgravescs avatar Jun 12 '23 18:06 tgravescs

This change is run in production workloads but we haven't noticed performance issues. At what scale did the performance issues come up and how were they detected in your case?

yliou avatar Jun 12 '23 21:06 yliou