tpch-spark
tpch-spark copied to clipboard
TPC-H queries in Apache Spark SQL using native DataFrames API
tpch-spark
TPC-H queries implemented in Spark using the DataFrames API.
The TPC-H is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. — https://www.tpc.org/tpch
Getting started
Prerequisites: Apache Spark
tpch-spark requires that Spark is installed on your machine. You can download Spark from https://spark.apache.org/downloads.html. At a high level, to install Spark you have to:
# Step 1:
tar xvfz <the tgz file you downloaded>
# Step 2:
# [optionally move/rename the untarred directory wherever you want, say, $HOME/spark]
# Step 3:
export PATH=$PATH:$HOME/spark/bin
# or better yet, add the above to your bashrc (or equivalent) and source it.
A. Get the code
git clone https://github.com/ssavvides/tpch-spark
cd tpch-spark
B. Generate input data tables
Navigate to the data generator directory dbgen and build the data generator:
cd dbgen
make
This should generate an executable called dbgen. Use the -h flag to see the various options the tool offers.
./dbgen -h
The simplest case is running the dbgen executable with no flags.
./dbgen
The above generates tables with extension .tbl with scale 1 (default) for a total of roughly 1GB size across all tables.
$ ls -hl *.tbl
-rw-rw-r-- 1 savvas savvas 24M May 28 12:39 customer.tbl
-rw-rw-r-- 1 savvas savvas 725M May 28 12:39 lineitem.tbl
-rw-rw-r-- 1 savvas savvas 2.2K May 28 12:39 nation.tbl
-rw-rw-r-- 1 savvas savvas 164M May 28 12:39 orders.tbl
-rw-rw-r-- 1 savvas savvas 114M May 28 12:39 partsupp.tbl
-rw-rw-r-- 1 savvas savvas 24M May 28 12:39 part.tbl
-rw-rw-r-- 1 savvas savvas 389 May 28 12:39 region.tbl
-rw-rw-r-- 1 savvas savvas 1.4M May 28 12:39 supplier.tbl
For different size tables you can use the -s (scale) option. For example,
./dbgen -s 10
will generate roughly 10GB of input data.
Note that by default, dbgen uses a | as a column separator, and includes a | at the end of each entry.
$ cat region.tbl
0|AFRICA|lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to |
1|AMERICA|hs use ironic, even requests. s|
2|ASIA|ges. thinly even pinto beans ca|
3|EUROPE|ly final courts cajole furiously final excuse|
4|MIDDLE EAST|uickly special accounts cajole carefully blithely close requests. carefully final asymptotes haggle furiousl|
You can find the schemas of the generated tables in the TPC-H specification
C. Build tpch-spark
tpch-spark is written in Scala as a self-contained Spark application.
Use the provided sbt file to build tpch-spark as a spark application.
cd tpch-spark
sbt package
The above command will package the application into a jar file, e.g., target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar which you will be needing in the next step.
D. Run tpch-spark
You can run all TPC-H queries from Q01 to Q22 by running:
spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar
If you want to run a specific query you can use
spark-submit --class "main.scala.TpchQuery" target/scala-2.12/spark-tpc-h-queries_2.12-1.0.jar <query number>
where <query number> is the number of the query to run, i.e., 1, 2, ..., 22.
N.B.: By default, tpch-spark will look for the input data files (the *.tbl files generated by dbgen) in "<current working directory>/dbgen". You can point to another location by setting the environment variable TPCH_INPUT_DATA_DIR. By default, the query results will be stored in "${TPCH_INPUT_DATA_DIR}/output/{Q01, Q02, ...}, or to whatever location TPCH_QUERY_OUTPUT_DIR is set. The execution times for each query run will be stored in a file with path "<current working directory>/tpch_execution_times.txt" or to whatever file path TPCH_EXECUTION_TIMES points to.
For example, to replace the default locations you can use:
export TPCH_INPUT_DATA_DIR="$HOME/tpch-data"
export TPCH_QUERY_OUTPUT_DIR="$HOME/tpch-results"
export TPCH_EXECUTION_TIMES="$HOME/tpch-times.txt"
Other Implementations
-
Data generator (http://www.tpc.org/tpch/)
-
TPC-H for Hive (https://issues.apache.org/jira/browse/hive-600)
-
TPC-H for PIG (https://github.com/ssavvides/tpch-pig)
Contact
Savvas Savvides [email protected]