kyuubi
kyuubi copied to clipboard
[SPARK] Add benchmark for Spark TRowSet generation of row-based and column-based
:mag: Description
Issue References 🔗
Subtask of #5808.
Describe Your Solution 🔧
Add performance benchmark for Spark TRowSet generation for
- row-based TRowSet on HIVE_CLI_SERVICE_PROTOCOL_V5 and below
- column-based TRowSet on HIVE_CLI_SERVICE_PROTOCOL_V6 and above
Types of changes :bookmark:
- [ ] Bugfix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
Test Plan 🧪
Behavior Without This Pull Request :coffin:
Behavior With This Pull Request :tada:
Row-based:
Column-based:
Related Unit Tests
Added "to row set benchmark" ut in Spark Engine's RowSetSuite.
Checklists
📝 Author Self Checklist
- [x] My code follows the style guidelines of this project
- [x] I have performed a self-review
- [x] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
- [x] This patch was not authored or co-authored using Generative Tooling
📝 Committer Pre-Merge Checklist
- [ ] Pull request title is okay.
- [ ] No license issues.
- [ ] Milestone correctly set?
- [ ] Test coverage is ok
- [ ] Assignees are selected.
- [ ] Minimum number of approvals
- [ ] No changes are requested
Be nice. Be informative.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
79b24a7
) 61.44% compared to head (cba080a
) 61.29%. Report is 5 commits behind head on master.
:exclamation: Current head cba080a differs from pull request most recent head 51919fb. Consider uploading reports for the commit 51919fb to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## master #5809 +/- ##
============================================
- Coverage 61.44% 61.29% -0.15%
Complexity 23 23
============================================
Files 608 608
Lines 36094 36027 -67
Branches 4952 4952
============================================
- Hits 22178 22083 -95
- Misses 11522 11560 +38
+ Partials 2394 2384 -10
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Should we create a new class separately? And you can refer to org.apache.spark.sql.ZorderCoreBenchmark
.
Should we create a new class separately? And you can refer to
org.apache.spark.sql.ZorderCoreBenchmark
.
Thanks for the advice. Moved to a new class RowSetBenchmark
.
And based on your screenshots in your PR desc, what are actually the control group, experimental groups?
And based on your screenshots in your PR desc, what are actually the control group, experimental groups?
There is no control or experimental group in this PR. It provides a benchmark tool for evaluating both column-based and row-based rowset for the access from V5 and V6 above. In the coming-up experiments, the benchmark will be run on the base version and different improvement implementations for comparison.
And we could decouple it with Spark's utils and move it to kyuubi-util module for a general light-weight benchmark kit in the future. And when it's ready to integrate JMH in Kyuubi with Maven + sbt + Scala, this benchmark toolkit is able to be removed.
In the coming-up experiments
If there are a bunch of PRs, I suggest you create an umbrella, and an KPIP(discuss/vote in the dev list) is necessary for introduce a benchmarking framework.
And based on your screenshots in your PR desc, what are actually the control group, experimental groups?
As I understand, Behavior Without This Pull Request
is the control group and Behavior With This Pull Request
is experimental group.
I will keep my -1 as the testing purpose here is not clear
I think the testing purpose is clear, this is for benchmarking the conversion performance from rows
of sql result to TRowSet
.
@wForget
You are correct about the purpose of this PR, but the benchmark itself needs to be corrected. Technically, if we introduce the Spark benchmark tool, the first line of results in each single benchmark should be the control group as it always produces 1x for Relative
.
The current test also varies the simple rule of univariate analysis
for experiments. What I saw in the results is a mess, TBH.
You are correct about the purpose of this PR, but the benchmark itself needs to be corrected. Technically, if we introduce the Spark benchmark tool, the first line of results in each single benchmark should be the control group as it always produces 1x for
Relative
.The current test also varies the simple rule of
univariate analysis
for experiments. What I saw in the results is a mess, TBH.
Got it, thank you for your explanation.
Closing this PR with no enough consensus on the purposes, the design, the changes and the approaches.
In the coming-up experiments
If there are a bunch of PRs, I suggest you create an umbrella, and an KPIP(discuss/vote in the dev list) is necessary for introduce a benchmarking framework.
I'm strongly against your comment here. First, the umbrella issue is created for the whole task list that is still extendable, Second, you did not allow me to use the test-jars of Spark for using existed benchmark kit, unintentionally or intentionally ignoring that several benchmark tests have already introduced on it . Third, you told me to raise a KPIP for such a duplicated framework from a copied implementation. I respect all your comments but I just extremely unwillingly to see every and every and every effort in resolving this problem has been deliberately disregarded and pulled back a meter back for a inch forward. I did no evil and did not violate any community code of conduct now and ever! WHY make it difficult for me !!!
I'm strongly against your comment here. ... I respect all your comments
Hi @bowenliang123. First thing first, calm down.
I want to clarify that I am a regular contributor/PMC member of Apache Kyuubi, just like everyone else. My comments on this PR are simply my personal opinion. I have left a veto with explanations, which also have been challenged and discussed.
I did no evil and did not violate any community code of conduct now and ever! WHY make it difficult for me !!!
I know you well in person. You and nobody else don't violate CoC in this PR.
Third, you told me to raise a KPIP for such a duplicated framework from a copied implementation.
When in doubt, if a committer thinks a change needs a KPIP, it does.
--- https://kyuubi.apache.org/improvement-proposals.html