tis
tis copied to clipboard
通过构建TPC-H数据集之上的宽表说明TIS 数据流分析(EMR)的使用方式
TISçæ°æ®æµåæï¼EMRï¼ åè½åºæ¬å¯ç¨ï¼éè¦ç»åTPC-Hï¼https://www.tpc.org/tpch/default5.aspï¼ éè¿æ°æ®æµåæï¼EMRï¼æ¥æå»ºç¦»çº¿T+1 宽表ä¾ä¸å¡ç³»ç»ä½¿ç¨
ç¸å ³ææ¡£
- https://help.aliyun.com/zh/hologres/user-guide/test-plan#li-fiz-p0h-7uq
- https://www.tpc.org/tpch/default5.asp
宽表æå»º
以䏿¯ä¸¤ä¸ªåºäº TPC-H æ°æ®éç Hive 宽表æå»ºç¤ºä¾ï¼è¦çå ¸åçä¸å¡åæåºæ¯ï¼å¦è®¢ååæãä¾åºé¾åæï¼ãå设 TPC-H çåå§è¡¨å·²éè¿ Hive å¤é¨è¡¨æå¯¼å ¥æ¹å¼åå¨å¨ Hive ä¸ã
ç¤ºä¾ 1ï¼è®¢å-客æ·-å°ç宽表
ç®æ
å° ordersãlineitemãcustomerãnationãregion 表æ´å为ä¸ä¸ªå¤§å®½è¡¨ï¼ç¨äºåæ 订å详æ
ã客æ·ä¿¡æ¯åå°çåå¸ã
Hive SQL
-- å建宽表ï¼ä½¿ç¨ ORC æ ¼å¼ä¼åæ§è½ï¼
CREATE TABLE IF NOT EXISTS order_customer_wide
STORED AS ORC
AS
SELECT
o.O_ORDERKEY AS order_key,
o.O_ORDERSTATUS AS order_status,
o.O_TOTALPRICE AS total_price,
o.O_ORDERDATE AS order_date,
o.O_ORDERPRIORITY AS order_priority,
l.L_PARTKEY AS part_key,
l.L_SUPPKEY AS supplier_key,
l.L_QUANTITY AS quantity,
l.L_EXTENDEDPRICE AS extended_price,
l.L_DISCOUNT AS discount,
l.L_TAX AS tax,
l.L_SHIPDATE AS ship_date,
c.C_NAME AS customer_name,
c.C_ADDRESS AS customer_address,
c.C_PHONE AS customer_phone,
c.C_ACCTBAL AS account_balance,
c.C_MKTSEGMENT AS market_segment,
n.N_NAME AS nation_name,
r.R_NAME AS region_name
FROM
orders o
JOIN
lineitem l ON o.O_ORDERKEY = l.L_ORDERKEY
JOIN
customer c ON o.O_CUSTKEY = c.C_CUSTKEY
JOIN
nation n ON c.C_NATIONKEY = n.N_NATIONKEY
JOIN
region r ON n.N_REGIONKEY = r.R_REGIONKEY;
å®½è¡¨åæ®µè¯´æ
- è®¢åæ ¸å¿å段:
order_key,order_status,total_price,order_date - 订å项ç»è:
part_key,supplier_key,quantity,extended_priceç - 客æ·ä¿¡æ¯:
customer_name,account_balance,market_segment - å°çä¿¡æ¯:
nation_name,region_name
å ¸ååæåºæ¯
- æåºåç»è®¡è®¢åæ»é¢ï¼
SELECT region_name, SUM(total_price) FROM order_customer_wide GROUP BY region_name - 客æ·å群ï¼å¸åºç»å + åºåï¼ï¼
SELECT market_segment, region_name, COUNT(DISTINCT customer_name) FROM ... GROUP BY ...
ç¤ºä¾ 2ï¼ä¾åºé¾-é¶ä»¶-ä¾åºå宽表
ç®æ
æ´å partãpartsuppãsupplierãnationãregion 表ï¼ç¨äºåæ é¶ä»¶ä¾åºææ¬ãä¾åºåå°ååå¸ã
Hive SQL
-- å建宽表ï¼ä½¿ç¨ååºä¼åæ¥è¯¢ï¼æ p_type ååºï¼
CREATE TABLE IF NOT EXISTS supply_chain_wide
PARTITIONED BY (p_type STRING)
STORED AS ORC
AS
SELECT
p.P_PARTKEY AS part_key,
p.P_NAME AS part_name,
p.P_MFGR AS manufacturer,
p.P_BRAND AS brand,
p.P_SIZE AS size,
p.P_CONTAINER AS container,
p.P_RETAILPRICE AS retail_price,
ps.PS_SUPPKEY AS supplier_key,
ps.PS_AVAILQTY AS available_quantity,
ps.PS_SUPPLYCOST AS supply_cost,
s.S_NAME AS supplier_name,
s.S_ADDRESS AS supplier_address,
s.S_PHONE AS supplier_phone,
n.N_NAME AS supplier_nation,
r.R_NAME AS supplier_region,
p.P_TYPE AS p_type -- ç¨ä½ååºå段
FROM
part p
JOIN
partsupp ps ON p.P_PARTKEY = ps.PS_PARTKEY
JOIN
supplier s ON ps.PS_SUPPKEY = s.S_SUPPKEY
JOIN
nation n ON s.S_NATIONKEY = n.N_NATIONKEY
JOIN
region r ON n.N_REGIONKEY = r.R_REGIONKEY;
å®½è¡¨åæ®µè¯´æ
- é¶ä»¶ä¿¡æ¯:
part_key,part_name,manufacturer,retail_price - ä¾åºå
³ç³»:
supplier_key,available_quantity,supply_cost - ä¾åºåå°ç:
supplier_nation,supplier_region - ååºå段:
p_typeï¼æé¶ä»¶ç±»åååºï¼å éæ¥è¯¢ï¼
å ¸ååæåºæ¯
- æåºåç»è®¡ä¾åºåå¹³åä¾åºææ¬ï¼
SELECT supplier_region, AVG(supply_cost) FROM supply_chain_wide GROUP BY supplier_region - 髿æ¬é¶ä»¶çéï¼
SELECT part_name, supplier_name, supply_cost FROM supply_chain_wide WHERE supply_cost > 1000
å ³é®è®¾è®¡ç¹
-
æ°æ®åä½ä¸æ¥è¯¢æç
- 宽表éè¿åä½åå¨åå° JOIN æä½ï¼éå OLAP åºæ¯ï¼ä½éæè¡¡å卿æ¬ã
- 使ç¨
ORCæ ¼å¼ + å缩ï¼å¦SNAPPYï¼ä¼ååå¨åæ¥è¯¢æ§è½ã
-
ååºçç¥
- æé«é¢è¿æ»¤å段ï¼å¦
p_typeãorder_dateï¼ååºï¼å éæ¥è¯¢ã
- æé«é¢è¿æ»¤å段ï¼å¦
-
åæ®µå½åè§è
- 对åååæ®µï¼å¦
N_NAMEï¼æ·»å åç¼ï¼å¦supplier_nationï¼ï¼é¿å æ§ä¹ã
- 对åååæ®µï¼å¦
-
æ°æ®ä¸è´æ§
- ç¡®ä¿åå§è¡¨å¤é®å
³èæ£ç¡®ï¼å¦
partsuppå¿ é¡»å ³èå°ææçsupplierï¼ã
- ç¡®ä¿åå§è¡¨å¤é®å
³èæ£ç¡®ï¼å¦
注æäºé¡¹
-
æ°æ®çæä¸å¯¼å ¥
- è¥ TPC-H æ°æ®å¨ MySQL ä¸ï¼éå
导åºä¸º CSVï¼åéè¿ Hive ç
LOAD DATAæhdfs put+ å¤é¨è¡¨å è½½ã
- è¥ TPC-H æ°æ®å¨ MySQL ä¸ï¼éå
导åºä¸º CSVï¼åéè¿ Hive ç
-
æ§è½è°ä¼
- è°æ´ Hive åæ°ï¼
set hive.exec.parallel=true;ï¼å¹¶è¡æ§è¡ï¼ - 对大表å¯ç¨ MapJoinï¼
set hive.auto.convert.join=true;
- è°æ´ Hive åæ°ï¼
-
å®½è¡¨æ´æ°
- TPC-H æ¯éææ°æ®éï¼æ éæ´æ°ãè¥éå¢éæ´æ°ï¼å¯ç»å Hive äºå¡è¡¨ï¼ACID ç¹æ§ï¼ã
éè¿è¿ä¸¤ä¸ªå®½è¡¨ï¼å¯è¦ç TPC-H ä¸ 80% çå ¸ååæåºæ¯ï¼åæ¶åå°å¤æ JOIN 带æ¥çæ§è½å¼éã
TPC-H å¨MySQLä¹ä¸å®è£
ä½¿ç¨ TPC-H çæ MySQL 表éè¦ä»¥ä¸æ¥éª¤ï¼å为 çææµè¯æ°æ® å å¯¼å ¥å° MySQL 两é¨åï¼
ä¸ãçæ TPC-H æµè¯æ°æ®
以䏿¯åºäº TPC-H æå»º MySQL æµè¯æ°æ®åºçè¯¦ç»æ¥éª¤ï¼
1. ä¸è½½å¹¶è§£å TPC-H å·¥å ·å
wget https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp -O TPC-H.zip
unzip TPC-H.zip -d TPC-H
cd TPC-H
2. ç¼è¯æ°æ®çæå·¥å
· dbgen
-
å®è£ ä¾èµï¼
# Ubuntu/Debian sudo apt-get install build-essential gcc make # CentOS/RHEL sudo yum install gcc make -
ç¼è¯
dbgenï¼cd dbgen cp makefile.suite makefileç¼è¾
makefileï¼ä¿®æ¹ä»¥ä¸é ç½®ï¼CC = gcc DATABASE= MYSQL MACHINE = LINUX WORKLOAD = TPCHç¼è¯ï¼
make
æ§è¡è¿ç¨ä¼åºé
å¨ TPC-H V3.0.1 ç®å½çdbgen ç®å½ä¸ æ§è¡make å½ä»¤ ï¼æ¥å以ä¸é误信æ¯ï¼
qgen.c: å¨å½æ°âqsubâä¸:
qgen.c:175:22: é误ï¼âSET_ROWCOUNTâæªå£°æ(卿¤å½æ°å
ç¬¬ä¸æ¬¡ä½¿ç¨)
fprintf(ofp, SET_ROWCOUNT, rowcnt);
^
qgen.c:175:22: éæ³¨ï¼æ¯ä¸ªæªå£°æçæ è¯ç¬¦å¨å
¶åºç°ç彿°å
åªæ¥å䏿¬¡
qgen.c:191:45: é误ï¼âSTART_TRANâæªå£°æ(卿¤å½æ°å
ç¬¬ä¸æ¬¡ä½¿ç¨)
fprintf(ofp,"%s\n", START_TRAN);
^
qgen.c:197:38: é误ï¼âSET_DBASEâæªå£°æ(卿¤å½æ°å
ç¬¬ä¸æ¬¡ä½¿ç¨)
fprintf(ofp, SET_DBASE, db_name);
^
qgen.c:203:45: é误ï¼âEND_TRANâæªå£°æ(卿¤å½æ°å
ç¬¬ä¸æ¬¡ä½¿ç¨)
fprintf(ofp,"%s\n", END_TRAN);
^
qgen.c:218:54: é误ï¼âSET_OUTPUTâæªå£°æ(卿¤å½æ°å
ç¬¬ä¸æ¬¡ä½¿ç¨)
fprintf(ofp,"%s '%s/%s.%d'", SET_OUTPUT, osuff,
^
qgen.c:235:46: é误ï¼âGEN_QUERY_PLANâæªå£°æ(卿¤å½æ°å
ç¬¬ä¸æ¬¡ä½¿ç¨)
fprintf(ofp, "%s\n", GEN_QUERY_PLAN);
é®é¢åå
è¿äºé误æ¯ç±äº TPC-H ç qgen å·¥å
·å¨çææ¥è¯¢æ¶ä¾èµçæ°æ®åºç¹å®å®ï¼å¦ SET_ROWCOUNTãSTART_TRAN çï¼æªå¨ MySQL é
ç½®ä¸å®ä¹ãè¿äºå®é常ç¨äºéé
ä¸åæ°æ®åºçè¯æ³ï¼å¦ SQL Server çäºå¡å½ä»¤ï¼ï¼ä½ MySQL ä¸éè¦å®ä»¬ï¼å æ¤éè¦æå¨ç¦ç¨æè°æ´ä»£ç ã
è§£å³æ¹æ¡
æ¥éª¤ 1ï¼ç¼è¾ tpcd.h 头æä»¶
å¨ dbgen ç®å½ä¸æ¾å° tpcd.h 头æä»¶ï¼æ·»å 缺失çå®å®ä¹ï¼
cd TPC-H/dbgen
vim tpcd.h # æä½¿ç¨å
¶ä»ç¼è¾å¨
卿件æ«å°¾æ·»å 以ä¸å 容ï¼
/* MySQL ä¸éè¦è¿äºå®ï¼ç´æ¥å®ä¹ä¸ºç©º */
#define SET_ROWCOUNT ""
#define START_TRAN ""
#define SET_DBASE ""
#define END_TRAN ""
#define SET_OUTPUT ""
#define GEN_QUERY_PLAN ""
ä¿åå¹¶éåºã
æ¥éª¤ 2ï¼ä¿®æ¹ qgen.c 代ç ï¼å¯éï¼
妿ä»ç¶æ¥éï¼å¯ä»¥æ³¨éæç¸å ³ä»£ç è¡ãä¾å¦ï¼
// å¨ qgen.c 䏿¾å°ä»¥ä¸ä»£ç 并注éï¼
// fprintf(ofp, SET_ROWCOUNT, rowcnt); // 第175è¡éè¿
// fprintf(ofp,"%s\n", START_TRAN); // 第191è¡éè¿
// fprintf(ofp, SET_DBASE, db_name); // 第197è¡éè¿
// fprintf(ofp,"%s\n", END_TRAN); // 第203è¡éè¿
// fprintf(ofp,"%s '%s/%s.%d'", SET_OUTPUT, osuff, ...); // 第218è¡éè¿
// fprintf(ofp, "%s\n", GEN_QUERY_PLAN); // 第235è¡éè¿
æ¥éª¤ 3ï¼éæ°ç¼è¯
æ¸ çä¹åçç¼è¯ç»æå¹¶éæ°ç¼è¯ï¼
make clean
make
3. çææµè¯æ°æ®
- çææ°æ®æä»¶ï¼
.tblï¼ï¼
ä¼çæ# çæ 1GB æ°æ®ï¼è°æ´ -s åæ°æ§å¶å¤§å°ï¼å¦ -s 10 çæ 10GBï¼ ./dbgen -s 1 -fcustomer.tbl,orders.tbl,lineitem.tblçæä»¶ã
4. å建 MySQL æ°æ®åº
-- ç»å½ MySQL
mysql -u root -p
-- åå»ºæ°æ®åº
CREATE DATABASE tpch;
USE tpch;
5. åå»ºè¡¨ç»æ
- ä» TPC-H å·¥å
·å
ä¸è·å DDL èæ¬ï¼
dss.ddlï¼ï¼å¹¶éé MySQL è¯æ³ï¼CREATE TABLE nation ( n_nationkey INTEGER PRIMARY KEY, n_name CHAR(25), n_regionkey INTEGER, n_comment VARCHAR(152) ); -- 类似å°åå»ºå ¶ä»è¡¨ï¼region, part, supplier, partsupp, customer, orders, lineitemï¼
6. 转æ¢å¹¶å¯¼å ¥æ°æ®
-
转æ¢
.tblæä»¶ä¸º MySQL å ¼å®¹æ ¼å¼ï¼sed -i 's/|$//' *.tbl # å 餿¯è¡æ«å°¾ç | åé符 -
使ç¨
LOAD DATAå¯¼å ¥æ°æ®ï¼-- 示ä¾ï¼å¯¼å ¥ nation 表 LOAD DATA LOCAL INFILE 'nation.tbl' INTO TABLE nation FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; -- éå¤å¯¼å ¥å ¶ä»è¡¨æ°æ®
7. æ·»å ç´¢å¼ä¸çº¦æ
-- 示ä¾ï¼ä¸º orders 表添å 主é®
ALTER TABLE orders ADD PRIMARY KEY (o_orderkey);
-- 为 lineitem æ·»å å¤é®
ALTER TABLE lineitem
ADD FOREIGN KEY (l_orderkey) REFERENCES orders(o_orderkey);
8. éªè¯æ°æ®å®æ´æ§
-- æ£æ¥è¡¨è¡æ°
SELECT COUNT(*) FROM lineitem;
-- é¢æç»æï¼6,001,215 è¡ï¼-s 1 æ¶ï¼
常è§é®é¢è§£å³
- æéé®é¢ï¼
chmod +x dbgen - MySQL å®å
¨æ¨¡å¼éå¶ï¼
å¨
my.cnf䏿·»å ï¼[mysqld] secure_file_priv = "" - æ¥ææ ¼å¼é®é¢ï¼
ä¿®æ¹
dss.ddlä¸çæ¥æåæ®µç±»å为DATEã
9. è¿è¡ TPC-H æ¥è¯¢
- 使ç¨
qgenå·¥å ·çææ¥è¯¢ï¼éé¢å¤ç¼è¯ï¼ï¼./qgen -s 1 > queries.sql - å¨ MySQL 䏿§è¡çæç SQLã
éè¿ä»¥ä¸æ¥éª¤ï¼æ¨å°è·å¾ä¸ä¸ªå®æ´ç TPC-H æµè¯æ°æ®åºã坿 ¹æ®ç¡¬ä»¶èµæºè°æ´ -s åæ°çæä¸åè§æ¨¡çæ°æ®éã
äºãå建 MySQL è¡¨ç»æ
-
è°æ´ TPC-H ç DDL èæ¬
- TPC-H çé»è®¤ DDL å¯è½å
å«é MySQL è¯æ³ï¼å¦
DISTRIBUTED BYï¼ãä¿®æ¹dss.ddlï¼ç¤ºä¾ï¼ï¼CREATE TABLE CUSTOMER ( C_CUSTKEY INTEGER NOT NULL, C_NAME VARCHAR(25) NOT NULL, C_ADDRESS VARCHAR(40) NOT NULL, C_NATIONKEY INTEGER NOT NULL, C_PHONE CHAR(15) NOT NULL, C_ACCTBAL DECIMAL(15,2) NOT NULL, C_MKTSEGMENT CHAR(10) NOT NULL, C_COMMENT VARCHAR(117) NOT NULL, PRIMARY KEY (C_CUSTKEY) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4; - åçä¿®æ¹å
¶ä»è¡¨ï¼
orders,lineitem,part,supplier,partsupp,nation,regionï¼ã
- TPC-H çé»è®¤ DDL å¯è½å
å«é MySQL è¯æ³ï¼å¦
-
å¨ MySQL 䏿§è¡ DDL
mysql -u root -pCREATE DATABASE tpch; USE tpch; SOURCE /path/to/modified_dss.ddl;
ä¸ãå¯¼å ¥æ°æ®å° MySQL
-
å¤ç
.tblæä»¶- ç¡®ä¿æ°æ®æä»¶çåé符æ¯
|ï¼ä¸æ«å°¾æåé符ï¼MySQL çLOAD DATAéè¦ï¼ã - å¯éï¼å°
.tbl转æ¢ä¸º.csvï¼è¥éè¦ï¼ï¼sed 's/|$//' customer.tbl > customer.csv
- ç¡®ä¿æ°æ®æä»¶çåé符æ¯
-
使ç¨
LOAD DATAå¯¼å ¥-- 示ä¾ï¼å¯¼å ¥ customer 表 LOAD DATA INFILE '/path/to/customer.tbl' INTO TABLE customer FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';- é夿¤æ¥éª¤å¯¼å ¥ææè¡¨ã
åãéªè¯æ°æ®
- æ£æ¥è¡æ°
SELECT COUNT(*) FROM customer; -- é¢æç»æï¼150,000 è¡ï¼å½ -s 1 æ¶ï¼ - æ£æ¥å¤é®çº¦æ
SHOW CREATE TABLE orders; -- ç¡®ä¿å¤é®å¦ O_CUSTKEY æ£ç¡®å ³è
常è§é®é¢
-
æéé®é¢
- ç¡®ä¿ MySQL ç¨æ·æ
FILEæéï¼GRANT FILE ON *.* TO 'user'@'localhost'; - å¯å¨ MySQL æ¶æ·»å
--local-infile=1ï¼mysql --local-infile=1 -u root -p
- ç¡®ä¿ MySQL ç¨æ·æ
-
æ¥ææ ¼å¼
- å¦ææ¥ææ¥éï¼ä½¿ç¨
STR_TO_DATEï¼LOAD DATA INFILE '/path/to/orders.tbl' INTO TABLE orders FIELDS TERMINATED BY '|' (O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, ..., @O_ORDERDATE) SET O_ORDERDATE = STR_TO_DATE(@O_ORDERDATE, '%Y-%m-%d');
- å¦ææ¥ææ¥éï¼ä½¿ç¨
éè¿ä»¥ä¸æ¥éª¤ï¼ä½ å¯ä»¥å¨ MySQL 䏿åçæ TPC-H 表并导å
¥æµè¯æ°æ®ã妿éè¦æ´å°è§æ¨¡çæ°æ®æµè¯ï¼å¯è°æ´ -s åæ°ï¼å¦ -s 0.1 çæ 100MB æ°æ®ï¼ã
导入过程,可以构建一个shell脚本,一键搞定:
#!/bin/bash
# 配置参数(根据实际情况修改)
DB_NAME="tpch" # 数据库名
MYSQL_USER="root" # MySQL 用户名
TBL_DIR="./dbgen" # .tbl 文件所在目录(默认在dbgen目录下)
# 1. 清理 .tbl 文件格式(删除末尾的 |)
echo "[1/3] 清理 .tbl 文件格式..."
cd "$TBL_DIR" || exit 1
sed -i 's/|$//' *.tbl # 删除每行末尾的 |
echo "✅ .tbl 文件格式处理完成!"
# 2. 生成 MySQL LOAD DATA 导入脚本
echo "[2/3] 生成 MySQL 导入脚本..."
LOAD_SQL="load_data.sql"
cat << EOF > "$LOAD_SQL"
USE $DB_NAME;
-- 按依赖顺序导入表(先导入小表,再导入大表)
LOAD DATA LOCAL INFILE 'nation.tbl' INTO TABLE nation FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'region.tbl' INTO TABLE region FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'part.tbl' INTO TABLE part FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'supplier.tbl' INTO TABLE supplier FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'partsupp.tbl' INTO TABLE partsupp FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'customer.tbl' INTO TABLE customer FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'orders.tbl' INTO TABLE orders FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
LOAD DATA LOCAL INFILE 'lineitem.tbl' INTO TABLE lineitem FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
EOF
echo "✅ 导入脚本生成完成:$LOAD_SQL"
# 3. 执行导入操作
echo "[3/3] 导入数据到 MySQL(需要输入密码)..."
mysql -u "$MYSQL_USER" -p 123456 --local-infile=1 < "$LOAD_SQL"
echo "✅ 数据导入完成!"