mysql-5.6 icon indicating copy to clipboard operation
mysql-5.6 copied to clipboard

Figure out if MultiGet (BKA/MRR) can be used with correlated subqueries

Open yoshinorim opened this issue 3 years ago • 3 comments

Example:

CREATE TABLE `t1` (
  `id` bigint(20) NOT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE `t2` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
);

insert into t1 values (1),(2),(3);
insert into t2 values (2),(3),(4);

This query uses MultiGet via Batched Key Access.

mysql> set session optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
mysql> explain SELECT t1.* FROM t1 LEFT JOIN t2 ON t1.id=t2.id WHERE t2.id IS NULL;
+----+-------------+-------+--------+---------------+---------+---------+------------+------+------------------------------------------------------------------------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref        | rows | Extra                                                                        |
+----+-------------+-------+--------+---------------+---------+---------+------------+------+------------------------------------------------------------------------------+
|  1 | SIMPLE      | t1    | index  | NULL          | PRIMARY | 8       | NULL       |    3 | Using index                                                                  |
|  1 | SIMPLE      | t2    | eq_ref | PRIMARY       | PRIMARY | 8       | test.t1.id |    1 | Using where; Not exists; Using index; Using join buffer (Batched Key Access) |
+----+-------------+-------+--------+---------------+---------+---------+------------+------+------------------------------------------------------------------------------+
2 rows in set (0.00 sec)

These subqueries can't use MultiGet.

mysql> explain SELECT * FROM t1 WHERE id NOT IN (SELECT id FROM t2);
+----+--------------------+-------+-----------------+---------------+---------+---------+------+------+--------------------------+
| id | select_type        | table | type            | possible_keys | key     | key_len | ref  | rows | Extra                    |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+------+--------------------------+
|  1 | PRIMARY            | t1    | index           | NULL          | PRIMARY | 8       | NULL |    3 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | t2    | unique_subquery | PRIMARY       | PRIMARY | 8       | func |    1 | Using index; Using where |
+----+--------------------+-------+-----------------+---------------+---------+---------+------+------+--------------------------+
2 rows in set (0.00 sec)

mysql> explain SELECT * FROM t1 WHERE NOT EXISTS (SELECT id FROM t2 WHERE t1.id=t2.id);
+----+--------------------+-------+--------+---------------+---------+---------+------------+------+--------------------------+
| id | select_type        | table | type   | possible_keys | key     | key_len | ref        | rows | Extra                    |
+----+--------------------+-------+--------+---------------+---------+---------+------------+------+--------------------------+
|  1 | PRIMARY            | t1    | index  | NULL          | PRIMARY | 8       | NULL       |    3 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | t2    | eq_ref | PRIMARY       | PRIMARY | 8       | test.t1.id |    1 | Using where; Using index |
+----+--------------------+-------+--------+---------------+---------+---------+------------+------+--------------------------+
2 rows in set (0.00 sec)

If t1 is large, the subqueries need to call lots of Get, instead of MultiGet, so it's even more expensive.

yoshinorim avatar Feb 04 '21 00:02 yoshinorim

MySQL 8.0 has "anti-semi-join" optimization: https://mysqlserverteam.com/antijoin-in-mysql-8/ It is available since 8.0.17 (the same version that https://github.com/facebook/mysql-5.6/tree/fb-mysql-8.0.17 is based on).

Anti-semi-join handles WHERE ... NOT IN /NOT EXISTS queries. The EXPLAINs below show that Batched Key Access will be used together with anti-semi-join. (I haven't run any benchmarks, though). So, upgrading to MySQL 8 will solve the issue?

Test data: https://gist.github.com/spetrunia/9e25fc21b75bfdda2c98fbabfa8d9182

mysql> set optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on'; 
mysql>  explain SELECT * FROM t1 WHERE id NOT IN (SELECT id FROM t2)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 0
     filtered: 0.00
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: j1.t1.id
         rows: 1
     filtered: 100.00
        Extra: Using where; Not exists; Using index; Using join buffer (Batched Key Access)
2 rows in set, 1 warning (0.00 sec)
mysql> explain SELECT * FROM t1 WHERE NOT EXISTS (SELECT id FROM t2 WHERE t1.id=t2.id)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 3203
     filtered: 100.00
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: j1.t1.id
         rows: 1
     filtered: 100.00
        Extra: Using where; Not exists; Using index; Using join buffer (Batched Key Access)
2 rows in set, 2 warnings (0.00 sec)

spetrunia avatar Feb 04 '21 20:02 spetrunia

I should have provided more examples. Scalar subquery continued to enter Dependent Subquery path and did not use MRR.

CREATE TABLE `t3` (
  `id` bigint(20) NOT NULL,
  `value` bigint(20) NOT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE `t4` (
  `id` bigint(20) NOT NULL,
  `value` bigint(20) NOT NULL,
  PRIMARY KEY (`id`)
);

insert into t3 values (1,10), (2,20), (3,30),(4,40);
insert into t4 values (1,100),(2,200);

set session optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

select t3.value, (select t4.value from t4 where t3.id=t4.id) as t4_value from t3 where t3.id  between 1 and 2;
vs
select t3.value, t4.value from t3, t4 where t3.id between 1 and 2 and t3.id=t4.id;

mysql> explain select t3.value, (select t4.value from t4 where t3.id=t4.id) as t4_value from t3 where t3.id between 1 and 2;
+----+--------------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------------+
| id | select_type        | table | partitions | type   | possible_keys | key     | key_len | ref        | rows | filtered | Extra       |
+----+--------------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------------+
|  1 | PRIMARY            | t3    | NULL       | range  | PRIMARY       | PRIMARY | 8       | NULL       |    2 |   100.00 | Using where |
|  2 | DEPENDENT SUBQUERY | t4    | NULL       | eq_ref | PRIMARY       | PRIMARY | 8       | test.t3.id |    1 |   100.00 | NULL        |
+----+--------------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------------+


mysql> explain select t3.value, t4.value from t3, t4 where t3.id between 1 and 2 and t3.id=t4.id;
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+----------------------------------------+
| id | select_type | table | partitions | type   | possible_keys | key     | key_len | ref        | rows | filtered | Extra                                  |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+----------------------------------------+
|  1 | SIMPLE      | t4    | NULL       | range  | PRIMARY       | PRIMARY | 8       | NULL       |    1 |   100.00 | Using where                            |
|  1 | SIMPLE      | t3    | NULL       | eq_ref | PRIMARY       | PRIMARY | 8       | test.t4.id |    1 |   100.00 | Using join buffer (Batched Key Access) |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+----------------------------------------+


yoshinorim avatar Feb 05 '21 00:02 yoshinorim

Yes, if the subquery is in the select list, then batching is not possible.

Possible ways to solve this:

  • Batching for subqueries in the select list
  • Merge the subquery into parent select.

Option 1: Implement batching for select list

One needs to:

  • Accumulate output records (in the above example, values of {t3.value, t3.id}) in some kind of buffer.
  • Run the subquery in some kind of "multi-run" mode where multiple "parameter" values of {t4.id} are supplied and multiple results are produced.
  • When we are done with the subquery, produce final output rows to the user.

Difficulties:

We'll need to collect a list of subquery's correlation references.

In the general case, it is difficult to modify the subquery so that it is executed for multiple values of its parameter(s). This might be doable for some special cases.

Speaking of special cases, I'm wondering why is the subquery using "DEPENDENT SUBQUERY" and not "index_subquery"? (index_subquery is already a very special case so it would be much easier to check for).

Option 2: Convert to outer join

BKA works for join query, so why don't we convert the subquery into join (like the optimizer already does for semi-joins), and let the optimizer handle the rest?

Applicability criteria

The subquery must be a mergeable subquery (same criteria as for VIEW to have algorithm=merge or for conversion into semi-join).

Considerations

  1. The query with subquery should produce NULL when the subquery didn't find the match. This is what LEFT JOIN does (More on this in section #3 below).

  2. If the subquery produces multiple matches, the query should produce an error.

Possible ways to solve this: A. Do the conversion only when we know that the subquery will produce at most one match. (there is already code that does such check - a semi-join is converted to inner join when it produces at most one match)

B. Convert to a special kind of outer join which produces an error when there is more than one match.

  1. Expressions in select list.

So,

select 
  t3.value, 
  (select t4.value 
   from t4 where t3.id=t4.id) as t4_value 
from 
  t3 
where 
  t3.id between 1 and 2;

is converted into

select 
  t3.value, 
  t4.value as t4_value
from 
  t3 left join t4 on t3.id=t4.id -- subquery's WHERE
where t3.id between 1 and 2;

This is valid as long as the subquery's select list expression computes to NULL when subquery tables have null-complemented rows.

If this is not the case.. one can use something like:

IF (not_null_column IS NULL, NULL, original_subquery_select_list)

provided that subquery tables have a not_null_column, a column that will have NULL value only for NULL-complemented row. Alternatively, one could just disable the transformation for such complex cases.

spetrunia avatar Feb 08 '21 15:02 spetrunia