kyuubi
kyuubi copied to clipboard
[Umbrella] Improvements and evaluation for TRowSet generation of Spark Engine
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
Describe the proposal
RowSet generation is that taking the results from result iterator and serializing them into column-based or row-based TRowSet, which is the key point for transportation and performance in most common cases.
- It's been reported possibility drawbacks in looping the result iterator by wrapped stream in
SparkOperation
(https://github.com/apache/kyuubi/blob/master/externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/kyuubi/engine/spark/operation/SparkOperation.scala#L253C21-L253C26) - the performance of Spark Engine's RowSet.toTRowSet should be evaluated by benchmarks, for overall performance and for each data type with different mode of Thrift version and arrow based.
- Performance Improvements in Spark Engine's RowSet implementation
- Code cleanup in Spark Engine's RowSet generation
Task list
- [ ] benchmark ut
- #5809
- [ ] Add benchmark unit test for RowSet generation covering supported data types
- [ ] Add benchmark dedicated unit test for each supported data type for RowSet generation
- #5809
- [ ] Replace looping the iterator from
toSeq
(toStream
of Iterator) to immutable collection- [ ] Compare toStream/toSeq/toList/toVector
- ~~#5804~~
- [ ] Parallel processing for column-based TRowSet generation
- [ ] Performance improvements in data types
- #5811
- ~~DecimalType with column-based mode #5810~~
- ~~ArrayType of primitive data types with column-based mode~~
- Generalize TRowSet generator
- #5851
- #5861
- [ ] Code cleanup in RowSet of Spark Engine
- #5831
Are you willing to submit PR?
- [X] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
- [ ] No. I cannot submit a PR at this time.