dolt icon indicating copy to clipboard operation
dolt copied to clipboard

Suboptimal join query plan for TPC-C "new order" transaction

Open andy-wm-arthur opened this issue 1 year ago • 2 comments

TPC-C is composed of 5 transaction types. The "new order" transaction contains a join query on the customer and warehouse tables:

SELECT c_discount, c_last, c_credit, w_tax 
FROM customer1, warehouse1 
WHERE w_id = 1 AND c_w_id = w_id AND c_d_id = 2 AND c_id = 2327

The schemas for the customer and warehouse tables are as follows:

% dolt schema export customer1
CREATE TABLE `customer1` (
  `c_id` int NOT NULL,
  `c_d_id` tinyint NOT NULL,
  `c_w_id` smallint NOT NULL,
  `c_first` varchar(16),
  `c_middle` char(2),
  `c_last` varchar(16),
  `c_street_1` varchar(20),
  `c_street_2` varchar(20),
  `c_city` varchar(20),
  `c_state` char(2),
  `c_zip` char(9),
  `c_phone` char(16),
  `c_since` datetime,
  `c_credit` char(2),
  `c_credit_lim` bigint,
  `c_discount` decimal(4,2),
  `c_balance` decimal(12,2),
  `c_ytd_payment` decimal(12,2),
  `c_payment_cnt` smallint,
  `c_delivery_cnt` smallint,
  `c_data` text,
  PRIMARY KEY (`c_w_id`,`c_d_id`,`c_id`),
  KEY `idx_customer1` (`c_w_id`,`c_d_id`,`c_last`,`c_first`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin;

% dolt schema export warehouse1
CREATE TABLE `warehouse1` (
  `w_id` smallint NOT NULL,
  `w_name` varchar(10),
  `w_street_1` varchar(20),
  `w_street_2` varchar(20),
  `w_city` varchar(20),
  `w_state` char(2),
  `w_zip` char(9),
  `w_tax` decimal(4,2),
  `w_ytd` decimal(12,2),
  PRIMARY KEY (`w_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin;

The join condition matches a single row in each table and the result set is also a single row:

+------------+-------------+----------+-------+
| c_discount | c_last      | c_credit | w_tax |
+------------+-------------+----------+-------+
| 0.46       | PRESABLEBAR | GC       | 0.12  |
+------------+-------------+----------+-------+

This join has the following query plan in v0.40.16

+---------------------------------------------------------------------------------------------------------------------+
| plan                                                                                                                |
+---------------------------------------------------------------------------------------------------------------------+
| Project(customer1.c_discount, customer1.c_last, customer1.c_credit, warehouse1.w_tax)                               |
|  └─ IndexedJoin(customer1.c_w_id = warehouse1.w_id)                                                                 |
|      ├─ IndexedTableAccess(warehouse1 on [warehouse1.w_id] with ranges: [{[1, 1]}])                                 |
|      └─ Filter((customer1.c_d_id = 2) AND (customer1.c_id = 2327))                                                  |
|          └─ IndexedTableAccess(customer1 on [customer1.c_w_id,customer1.c_d_id,customer1.c_last,customer1.c_first]) |
+---------------------------------------------------------------------------------------------------------------------+

This query plan has a number of issues that cause it to access much more data than is necessary to answer the query:

  • The plan chooses the secondary index idx_customer1 for table customer1 table, rather than the clustered index, which causes an indirect secondary index lookup rather than a covering index lookup.
  • The plan does not pushdown the projection into the IndexedTableAccess node, meaning that many extraneous fields are accessed. The TEXT field c_data is particularly expensive to access as it requires an additional chunk read.
  • The filter predicate customer1.c_id = 2327 is not pushed down to the IndexedTableAccess node. This is possibly a side effect of the wrong index being chosen, as this column is not present in the index key for idx_customer1. The inability to pushdown this predicate is likely the most expensive problem with this query plan as it increases the cardinality of this index access from 1 to 3000.

The above query can be rewritten as:

SELECT c_discount, c_last, c_credit, w_tax 
FROM (
    SELECT c_w_id, c_discount, c_last, c_credit 
    FROM customer1 
    WHERE c_w_id = 1 AND c_d_id = 2 AND c_id = 2327 
) sub 
JOIN warehouse1 
ON sub.c_w_id = w_id 
WHERE w_id = 1

Which creates a much more optimal query plan:

+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan                                                                                                                                                  |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| Project(sub.c_discount, sub.c_last, sub.c_credit, warehouse1.w_tax)                                                                                   |
|  └─ InnerJoin(sub.c_w_id = warehouse1.w_id)                                                                                                           |
|      ├─ SubqueryAlias(sub)                                                                                                                            |
|      │   └─ Project(customer1.c_w_id, customer1.c_discount, customer1.c_last, customer1.c_credit)                                                     |
|      │       └─ Projected table access on [c_w_id c_discount c_last c_credit]                                                                         |
|      │           └─ IndexedTableAccess(customer1 on [customer1.c_w_id,customer1.c_d_id,customer1.c_id] with ranges: [{[1, 1], [2, 2], [2327, 2327]}]) |
|      └─ IndexedTableAccess(warehouse1 on [warehouse1.w_id] with ranges: [{[1, 1]}])                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------+

And results in significant performance improvements for the (modified) TPC-C benchmark

andy-wm-arthur avatar Jul 11 '22 22:07 andy-wm-arthur

joinIndexes.getUsableIndex currently returns the first index it finds that prefixes the join condition column. We would need the other filters in the above case at the time we select a "usable index" to know a certain index will be unique. Proving the uniqueness of input dependencies for optimal index selection winding up the join tree is a wider problem that we haven't thought about much.

max-hoffman avatar Jul 26 '22 19:07 max-hoffman

The only thing that is going to make this faster is putting filters into the memo, rewriting pushdown, implementing table statistics, and then using filter selectivity + statistics in costing.

max-hoffman avatar Dec 01 '22 17:12 max-hoffman

sbt> explain SELECT c_discount, c_last, c_credit, w_tax  FROM customer1, warehouse1  WHERE w_id = 1 AND c_w_id = w_id AND c_d_id = 2 AND c_id = 2327;
+---------------------------------------------------------------------------------------------+
| plan                                                                                        |
+---------------------------------------------------------------------------------------------+
| Project                                                                                     |
|  ├─ columns: [customer1.c_discount, customer1.c_last, customer1.c_credit, warehouse1.w_tax] |
|  └─ LookupJoin                                                                              |
|      ├─ (customer1.c_w_id = warehouse1.w_id)                                                |
|      ├─ IndexedTableAccess(warehouse1)                                                      |
|      │   ├─ index: [warehouse1.w_id]                                                        |
|      │   ├─ filters: [{[1, 1]}]                                                             |
|      │   └─ columns: [w_id w_tax]                                                           |
|      └─ Filter                                                                              |
|          ├─ ((customer1.c_d_id = 2) AND (customer1.c_id = 2327))                            |
|          └─ IndexedTableAccess(customer1)                                                   |
|              ├─ index: [customer1.c_w_id,customer1.c_d_id,customer1.c_id]                   |
|              └─ columns: [c_id c_d_id c_w_id c_last c_credit c_discount]                    |
+---------------------------------------------------------------------------------------------+
13 rows in set (0.00 sec)

sbt> SELECT c_discount, c_last, c_credit, w_tax  FROM customer1, warehouse1  WHERE w_id = 1 AND c_w_id = w_id AND c_d_id = 2 AND c_id = 2327;
+------------+--------------+----------+-------+
| c_discount | c_last       | c_credit | w_tax |
+------------+--------------+----------+-------+
| 0.05       | ESEATIONEING | GC       | 0.17  |
+------------+--------------+----------+-------+
1 row in set (0.00 sec)

max-hoffman avatar Jun 14 '23 00:06 max-hoffman

This is beautiful.

On Tue, Jun 13, 2023 at 5:09 PM Maximilian Hoffman @.***> wrote:

sbt> explain SELECT c_discount, c_last, c_credit, w_tax FROM customer1, warehouse1 WHERE w_id = 1 AND c_w_id = w_id AND c_d_id = 2 AND c_id = 2327;+---------------------------------------------------------------------------------------------+ | plan |+---------------------------------------------------------------------------------------------+ | Project | | ├─ columns: [customer1.c_discount, customer1.c_last, customer1.c_credit, warehouse1.w_tax] | | └─ LookupJoin | | ├─ (customer1.c_w_id = warehouse1.w_id) | | ├─ IndexedTableAccess(warehouse1) | | │ ├─ index: [warehouse1.w_id] | | │ ├─ filters: [{[1, 1]}] | | │ └─ columns: [w_id w_tax] | | └─ Filter | | ├─ ((customer1.c_d_id = 2) AND (customer1.c_id = 2327)) | | └─ IndexedTableAccess(customer1) | | ├─ index: [customer1.c_w_id,customer1.c_d_id,customer1.c_id] | | └─ columns: [c_id c_d_id c_w_id c_last c_credit c_discount] |+---------------------------------------------------------------------------------------------+13 rows in set (0.00 sec)

sbt> SELECT c_discount, c_last, c_credit, w_tax FROM customer1, warehouse1 WHERE w_id = 1 AND c_w_id = w_id AND c_d_id = 2 AND c_id = 2327;+------------+--------------+----------+-------+ | c_discount | c_last | c_credit | w_tax |+------------+--------------+----------+-------+ | 0.05 | ESEATIONEING | GC | 0.17 |+------------+--------------+----------+-------+1 row in set (0.00 sec)

— Reply to this email directly, view it on GitHub https://github.com/dolthub/dolt/issues/3797#issuecomment-1590232357, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJAR3DF6KYGXCZZEOMHKP3XLD6KJANCNFSM53I7OG4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

timsehn avatar Jun 14 '23 00:06 timsehn

I'll close. We are actively working on publishing TPC-* benchmarks.

timsehn avatar Sep 25 '23 22:09 timsehn