mysql-5.6
mysql-5.6 copied to clipboard
Figure out if MultiGet (BKA/MRR) can be used with correlated subqueries
Example:
CREATE TABLE `t1` (
`id` bigint(20) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `t2` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id`)
);
insert into t1 values (1),(2),(3);
insert into t2 values (2),(3),(4);
This query uses MultiGet via Batched Key Access.
mysql> set session optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
mysql> explain SELECT t1.* FROM t1 LEFT JOIN t2 ON t1.id=t2.id WHERE t2.id IS NULL;
+----+-------------+-------+--------+---------------+---------+---------+------------+------+------------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------------+------+------------------------------------------------------------------------------+
| 1 | SIMPLE | t1 | index | NULL | PRIMARY | 8 | NULL | 3 | Using index |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 8 | test.t1.id | 1 | Using where; Not exists; Using index; Using join buffer (Batched Key Access) |
+----+-------------+-------+--------+---------------+---------+---------+------------+------+------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
These subqueries can't use MultiGet.
mysql> explain SELECT * FROM t1 WHERE id NOT IN (SELECT id FROM t2);
+----+--------------------+-------+-----------------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | PRIMARY | 8 | NULL | 3 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | t2 | unique_subquery | PRIMARY | PRIMARY | 8 | func | 1 | Using index; Using where |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+------+--------------------------+
2 rows in set (0.00 sec)
mysql> explain SELECT * FROM t1 WHERE NOT EXISTS (SELECT id FROM t2 WHERE t1.id=t2.id);
+----+--------------------+-------+--------+---------------+---------+---------+------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+--------+---------------+---------+---------+------------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | PRIMARY | 8 | NULL | 3 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | t2 | eq_ref | PRIMARY | PRIMARY | 8 | test.t1.id | 1 | Using where; Using index |
+----+--------------------+-------+--------+---------------+---------+---------+------------+------+--------------------------+
2 rows in set (0.00 sec)
If t1 is large, the subqueries need to call lots of Get, instead of MultiGet, so it's even more expensive.
MySQL 8.0 has "anti-semi-join" optimization: https://mysqlserverteam.com/antijoin-in-mysql-8/ It is available since 8.0.17 (the same version that https://github.com/facebook/mysql-5.6/tree/fb-mysql-8.0.17 is based on).
Anti-semi-join handles WHERE ... NOT IN /NOT EXISTS queries. The EXPLAINs below show that Batched Key Access will be used together with anti-semi-join. (I haven't run any benchmarks, though). So, upgrading to MySQL 8 will solve the issue?
Test data: https://gist.github.com/spetrunia/9e25fc21b75bfdda2c98fbabfa8d9182
mysql> set optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
mysql> explain SELECT * FROM t1 WHERE id NOT IN (SELECT id FROM t2)\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 0
filtered: 0.00
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t2
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: j1.t1.id
rows: 1
filtered: 100.00
Extra: Using where; Not exists; Using index; Using join buffer (Batched Key Access)
2 rows in set, 1 warning (0.00 sec)
mysql> explain SELECT * FROM t1 WHERE NOT EXISTS (SELECT id FROM t2 WHERE t1.id=t2.id)\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
partitions: NULL
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 3203
filtered: 100.00
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t2
partitions: NULL
type: eq_ref
possible_keys: PRIMARY
key: PRIMARY
key_len: 8
ref: j1.t1.id
rows: 1
filtered: 100.00
Extra: Using where; Not exists; Using index; Using join buffer (Batched Key Access)
2 rows in set, 2 warnings (0.00 sec)
I should have provided more examples. Scalar subquery continued to enter Dependent Subquery path and did not use MRR.
CREATE TABLE `t3` (
`id` bigint(20) NOT NULL,
`value` bigint(20) NOT NULL,
PRIMARY KEY (`id`)
);
CREATE TABLE `t4` (
`id` bigint(20) NOT NULL,
`value` bigint(20) NOT NULL,
PRIMARY KEY (`id`)
);
insert into t3 values (1,10), (2,20), (3,30),(4,40);
insert into t4 values (1,100),(2,200);
set session optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
select t3.value, (select t4.value from t4 where t3.id=t4.id) as t4_value from t3 where t3.id between 1 and 2;
vs
select t3.value, t4.value from t3, t4 where t3.id between 1 and 2 and t3.id=t4.id;
mysql> explain select t3.value, (select t4.value from t4 where t3.id=t4.id) as t4_value from t3 where t3.id between 1 and 2;
+----+--------------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------------+
| 1 | PRIMARY | t3 | NULL | range | PRIMARY | PRIMARY | 8 | NULL | 2 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | t4 | NULL | eq_ref | PRIMARY | PRIMARY | 8 | test.t3.id | 1 | 100.00 | NULL |
+----+--------------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------------+
mysql> explain select t3.value, t4.value from t3, t4 where t3.id between 1 and 2 and t3.id=t4.id;
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+----------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+----------------------------------------+
| 1 | SIMPLE | t4 | NULL | range | PRIMARY | PRIMARY | 8 | NULL | 1 | 100.00 | Using where |
| 1 | SIMPLE | t3 | NULL | eq_ref | PRIMARY | PRIMARY | 8 | test.t4.id | 1 | 100.00 | Using join buffer (Batched Key Access) |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+----------------------------------------+
Yes, if the subquery is in the select list, then batching is not possible.
Possible ways to solve this:
- Batching for subqueries in the select list
- Merge the subquery into parent select.
Option 1: Implement batching for select list
One needs to:
- Accumulate output records (in the above example, values of
{t3.value, t3.id}
) in some kind of buffer. - Run the subquery in some kind of "multi-run" mode where multiple "parameter" values of
{t4.id}
are supplied and multiple results are produced. - When we are done with the subquery, produce final output rows to the user.
Difficulties:
We'll need to collect a list of subquery's correlation references.
In the general case, it is difficult to modify the subquery so that it is executed for multiple values of its parameter(s). This might be doable for some special cases.
Speaking of special cases, I'm wondering why is the subquery using "DEPENDENT SUBQUERY" and not "index_subquery"? (index_subquery is already a very special case so it would be much easier to check for).
Option 2: Convert to outer join
BKA works for join query, so why don't we convert the subquery into join (like the optimizer already does for semi-joins), and let the optimizer handle the rest?
Applicability criteria
The subquery must be a mergeable subquery (same criteria as for VIEW to have algorithm=merge or for conversion into semi-join).
Considerations
-
The query with subquery should produce NULL when the subquery didn't find the match. This is what LEFT JOIN does (More on this in section #3 below).
-
If the subquery produces multiple matches, the query should produce an error.
Possible ways to solve this: A. Do the conversion only when we know that the subquery will produce at most one match. (there is already code that does such check - a semi-join is converted to inner join when it produces at most one match)
B. Convert to a special kind of outer join which produces an error when there is more than one match.
- Expressions in select list.
So,
select
t3.value,
(select t4.value
from t4 where t3.id=t4.id) as t4_value
from
t3
where
t3.id between 1 and 2;
is converted into
select
t3.value,
t4.value as t4_value
from
t3 left join t4 on t3.id=t4.id -- subquery's WHERE
where t3.id between 1 and 2;
This is valid as long as the subquery's select list expression computes to NULL when subquery tables have null-complemented rows.
If this is not the case.. one can use something like:
IF (not_null_column IS NULL, NULL, original_subquery_select_list)
provided that subquery tables have a not_null_column
, a column that will have NULL value only for NULL-complemented row. Alternatively, one could just disable the transformation for such complex cases.