kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[Umbrella] Kyuubi Spark TPC-DS Connector

Open pan3793 opened this issue 3 years ago • 4 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the proposal

Spark DataSource V2 API[1] is available since Spark 3.0, basically, it provides a bunch of APIs for developers to implement a connector, and Spark will expose them to SQL/DataFrame API automatically with few configurations.

TPC-DS[2] dataset is very useful for benchmarking and demonstration. Previously, we need to generate the dataset by using dsdgen or kyuubi-tpcds tool before running queries. With the connector proposed by this PR, users just need

  1. Add jar kyuubi-spark-connector-tpcds_2.12-${kyuubi_version}.jar
  2. Add conf spark.sql.catalog.tpcds=org.apache.kyuubi.spark.connector.tpcds.TPCDSCatalog

Then they can query the different scales of TPC-DS tables under tpcds.sf{scale} database. For instance,

0: jdbc:hive2://0.0.0.0:10009/> show tables in tpcds.sf1;
+------------+-------------------------+--------------+
| namespace  |        tableName        | isTemporary  |
+------------+-------------------------+--------------+
| sf1        | call_center             | false        |
| sf1        | catalog_page            | false        |
| sf1        | catalog_returns         | false        |
| sf1        | catalog_sales           | false        |
| sf1        | customer                | false        |
| sf1        | customer_address        | false        |
| sf1        | customer_demographics   | false        |
| sf1        | date_dim                | false        |
| sf1        | household_demographics  | false        |
| sf1        | income_band             | false        |
| sf1        | inventory               | false        |
| sf1        | item                    | false        |
| sf1        | promotion               | false        |
| sf1        | reason                  | false        |
| sf1        | ship_mode               | false        |
| sf1        | store                   | false        |
| sf1        | store_returns           | false        |
| sf1        | store_sales             | false        |
| sf1        | time_dim                | false        |
| sf1        | warehouse               | false        |
| sf1        | web_page                | false        |
| sf1        | web_returns             | false        |
| sf1        | web_sales               | false        |
| sf1        | web_site                | false        |
+------------+-------------------------+--------------+

[1] https://github.com/apache/spark/tree/v3.2.1/sql/catalyst/src/main/java/org/apache/spark/sql/connector [2] https://tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v3.2.0.pdf

Task list

  • https://github.com/apache/incubator-kyuubi/pull/2531
  • https://github.com/apache/incubator-kyuubi/issues/2539
  • https://github.com/apache/incubator-kyuubi/issues/2540
  • https://github.com/apache/incubator-kyuubi/issues/2541
  • https://github.com/apache/incubator-kyuubi/issues/2542
  • https://github.com/apache/incubator-kyuubi/issues/2543
  • https://github.com/apache/incubator-kyuubi/issues/2553
  • https://github.com/apache/incubator-kyuubi/pull/2673
  • https://github.com/apache/incubator-kyuubi/issues/2679
  • https://github.com/apache/incubator-kyuubi/pull/2700
  • https://github.com/apache/incubator-kyuubi/pull/2701
  • https://github.com/apache/incubator-kyuubi/pull/2702
  • https://github.com/apache/incubator-kyuubi/issues/2704
  • https://github.com/apache/incubator-kyuubi/pull/2709
  • https://github.com/apache/incubator-kyuubi/pull/2729
  • https://github.com/apache/incubator-kyuubi/issues/2759
  • https://github.com/apache/incubator-kyuubi/pull/2777

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

pan3793 avatar May 02 '22 11:05 pan3793

I'm interested in this. Could you assign me some tasks?

wForget avatar May 03 '22 02:05 wForget

Thanks @wForget, since the https://github.com/apache/incubator-kyuubi/pull/2531 have not got merged, you can start from the review first.

pan3793 avatar May 03 '22 05:05 pan3793

@pan3793 I’m interested in working on the subtasks in this umbrella as well. I can start with #2553. Could you assign the subtask to me if no one is working on it?

yihua avatar May 08 '22 20:05 yihua

Thx, assigned.

pan3793 avatar May 09 '22 00:05 pan3793