spark
spark copied to clipboard
[SPARK-37340][UI] Display StageIds in Operators for SQL UI
What changes were proposed in this pull request?
Add explicit stageId to operator mapping in the Spark UI that is a more general version of https://issues.apache.org/jira/browse/SPARK-30209, where a stageId-> operator mapping is done with the following algorithm.
- Read SparkGraph to get every Node's name and respective AccumulatorIDs.
- Gets each stage's AccumulatorIDs.
- Maps Operators to stages by checking for non-zero intersection of Step 1 and 2's AccumulatorIDs.
- Connect SparkGraphNodes to respective StageIDs for rendering in SQL UI. As a result, some operators without max metrics values will also have stageIds in the UI. In some cases, there is no operator->StageID mapping made because no stageIds have accumulatorIds that are a part of the Operator's accumulatorIds. URL links at the top to go to the succeeded jobs and completed stages that were executed as a part of the selected query are also provided.
Why are the changes needed?
Makes for easier and quicker debugging and navigation.
Does this PR introduce any user-facing change?
Yes, Succeeded Jobs:
and Completed Stages:
listed at the top of the UI, along with Stages:
in some of the operators.
How was this patch tested?
Manual test locally in SQL UI.
Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49760/
Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49760/
Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49763/
Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49763/
Test build #145290 has finished for PR 34622 at commit 3489292
.
- This patch passes all tests.
- This patch merges cleanly.
- This patch adds the following public classes (experimental):
-
case class StageAttempt(
-
case class GraphNodeToStages(
Test build #145293 has finished for PR 34622 at commit 5734754
.
- This patch passes all tests.
- This patch merges cleanly.
- This patch adds the following public classes (experimental):
-
case class StageAttempt(
-
case class GraphNodeToStages(
cc @tgravescs would this feature be of interest?
cc @sarutak and @gengliangwang FYI
yes, it would be nice to have the actual stagIds in the ui, I'll need to look closer at the logic though, which likely won't be til next week.
Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50214/
Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/50214/
@tgravescs do you have time to take a quick look?
So a couple questions and concerns.
- I'm not sure having the list of Completed Stages at the top of the page helps and I'm concerned that could be a very long list. You can simply click on the job or go to the stages page to get there. You could also make the Stages link in the operator box clickable.
- what does this list for exchanges where the exchange crosses 2 stages? Similar hash aggregates.
- have you run this more than just a couple local jobs? ie on large jobs or in production?
I'll try to get this built and try it out locally
Thanks for the feedback comments so far.
- I originally added the list of completed stages for convenience on the SQL UI. Do you think it's worth removing the list of Completed Stages?
- In this case, multiple stages will show up in the operator box. It would look like
- I've run this in production at Workday.
sorry, I missed your response.
After looking some more the list of jobs can get very long as well. I'm fine with leaving the list of stages as well.
Do you know how long the connectOperatorToStage takes for a query with larger graph?
sorry, I missed your response.
After looking some more the list of jobs can get very long as well. I'm fine with leaving the list of stages as well.
Do you know how long the connectOperatorToStage takes for a query with larger graph?
Unfortunately I don't know how long it takes for connectOperatorToStage with queries with a larger path. I haven't seen issues regarding to runtime here at least.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!
@tgravescs @martin-g should I create another pull request for this feature to try to get it merged? I'm unable to reopen the PR.
sure, its been a while but I think I had tried this out and was seeing some performance issues with it. I'd have to relook at it to remember. Did you run any performance tests?
No, I don't know what sort of performance tests should be run for this feature.
yeah I believe I was implementing similar functionality in a tool we have and that algorithm had performance issues when things got large. I'll have to go dig it up again though.
I assume this change isn't something you are running with personally or in some production workloads?
This change is run in production workloads but we haven't noticed performance issues. At what scale did the performance issues come up and how were they detected in your case?