kyuubi
kyuubi copied to clipboard
[Umbrella] Kyuubi Spark TPC-DS Connector
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
Describe the proposal
Spark DataSource V2 API[1] is available since Spark 3.0, basically, it provides a bunch of APIs for developers to implement a connector, and Spark will expose them to SQL/DataFrame API automatically with few configurations.
TPC-DS[2] dataset is very useful for benchmarking and demonstration. Previously, we need to generate the dataset by using dsdgen or kyuubi-tpcds tool before running queries. With the connector proposed by this PR, users just need
- Add jar
kyuubi-spark-connector-tpcds_2.12-${kyuubi_version}.jar - Add conf
spark.sql.catalog.tpcds=org.apache.kyuubi.spark.connector.tpcds.TPCDSCatalog
Then they can query the different scales of TPC-DS tables under tpcds.sf{scale} database. For instance,
0: jdbc:hive2://0.0.0.0:10009/> show tables in tpcds.sf1;
+------------+-------------------------+--------------+
| namespace | tableName | isTemporary |
+------------+-------------------------+--------------+
| sf1 | call_center | false |
| sf1 | catalog_page | false |
| sf1 | catalog_returns | false |
| sf1 | catalog_sales | false |
| sf1 | customer | false |
| sf1 | customer_address | false |
| sf1 | customer_demographics | false |
| sf1 | date_dim | false |
| sf1 | household_demographics | false |
| sf1 | income_band | false |
| sf1 | inventory | false |
| sf1 | item | false |
| sf1 | promotion | false |
| sf1 | reason | false |
| sf1 | ship_mode | false |
| sf1 | store | false |
| sf1 | store_returns | false |
| sf1 | store_sales | false |
| sf1 | time_dim | false |
| sf1 | warehouse | false |
| sf1 | web_page | false |
| sf1 | web_returns | false |
| sf1 | web_sales | false |
| sf1 | web_site | false |
+------------+-------------------------+--------------+
[1] https://github.com/apache/spark/tree/v3.2.1/sql/catalyst/src/main/java/org/apache/spark/sql/connector [2] https://tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v3.2.0.pdf
Task list
- https://github.com/apache/incubator-kyuubi/pull/2531
- https://github.com/apache/incubator-kyuubi/issues/2539
- https://github.com/apache/incubator-kyuubi/issues/2540
- https://github.com/apache/incubator-kyuubi/issues/2541
- https://github.com/apache/incubator-kyuubi/issues/2542
- https://github.com/apache/incubator-kyuubi/issues/2543
- https://github.com/apache/incubator-kyuubi/issues/2553
- https://github.com/apache/incubator-kyuubi/pull/2673
- https://github.com/apache/incubator-kyuubi/issues/2679
- https://github.com/apache/incubator-kyuubi/pull/2700
- https://github.com/apache/incubator-kyuubi/pull/2701
- https://github.com/apache/incubator-kyuubi/pull/2702
- https://github.com/apache/incubator-kyuubi/issues/2704
- https://github.com/apache/incubator-kyuubi/pull/2709
- https://github.com/apache/incubator-kyuubi/pull/2729
- https://github.com/apache/incubator-kyuubi/issues/2759
- https://github.com/apache/incubator-kyuubi/pull/2777
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
I'm interested in this. Could you assign me some tasks?
Thanks @wForget, since the https://github.com/apache/incubator-kyuubi/pull/2531 have not got merged, you can start from the review first.
@pan3793 I’m interested in working on the subtasks in this umbrella as well. I can start with #2553. Could you assign the subtask to me if no one is working on it?
Thx, assigned.