datafusion
datafusion copied to clipboard
Incorrect LEFT JOIN evaluation result on OR conditions
Describe the bug
When the matching condition is on columns and combined with OR
, the result retuned by LEFT JOIN
is similar to INNER JOIN
, which is incorrect.
LEFT JOIN
should not reduce the number of rows returned, i.e. it should return all the records from the left table, and the matched records from the right table. When there is no match, the result is a NULL
, i.e. display as empty.
INNER JOIN
reduces the number of rows, i.e. it only returns the records that have matching values in both tables.
To Reproduce
- Use DataFusion CLI:
# in dir datafusion/datafusion-cli
$ cargo build
$ ./target/debug/datafusion-cli
- Write sample data
CREATE OR REPLACE TABLE employees(emp_id INT, name VARCHAR) AS VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol');
CREATE OR REPLACE TABLE department(emp_id INT, department VARCHAR) AS VALUES (1, 'HR'), (3, 'Engineering'), (4, 'Sales');
- This is how
employees
table anddepartment
table look like
> SELECT * FROM employees;
+--------+-------+
| emp_id | name |
+--------+-------+
| 1 | Alice |
| 2 | Bob |
| 3 | Carol |
+--------+-------+
3 row(s) fetched.
Elapsed 0.009 seconds.
> SELECT * FROM department;
+--------+-------------+
| emp_id | department |
+--------+-------------+
| 1 | HR |
| 3 | Engineering |
| 4 | Sales |
+--------+-------------+
3 row(s) fetched.
Elapsed 0.006 seconds.
- Query
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');
+--------+-------+-------------+
| emp_id | name | department |
+--------+-------+-------------+
| 1 | Alice | HR |
| 1 | Alice | Engineering |
| 1 | Alice | Sales |
| 2 | Bob | HR |
| 2 | Bob | Engineering |
| 2 | Bob | Sales |
+--------+-------+-------------+ <-- should have one more row for Carol
6 row(s) fetched.
Elapsed 0.013 seconds.
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
+--------+------+------------+
| emp_id | name | department |
+--------+------+------------+
+--------+------+------------+ <-- should have three rows, 1 row for Alice, 1 row for Bob, and 1 row for Carol
0 row(s) fetched.
Elapsed 0.014 seconds.
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'NotExist');
+--------+-------+-------------+
| emp_id | name | department |
+--------+-------+-------------+
| 1 | Alice | HR |
| 1 | Alice | Engineering |
| 1 | Alice | Sales |
+--------+-------+-------------+ <-- should have two more rows, 1 row for Bob and 1 row for Carol
3 row(s) fetched.
Elapsed 0.014 seconds.
Expected behavior
Postgres shows the expected results.
- In psql cli, drop tables and create new tables with data
DROP TABLE IF EXISTS employees;
DROP TABLE IF EXISTS department;
# create tables
CREATE TABLE employees(emp_id INT, name VARCHAR);
CREATE TABLE department(emp_id INT, dept_name VARCHAR);
# write sample data
INSERT INTO employees (emp_id, name) VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol');
INSERT INTO department (emp_id, dept_name) VALUES (1, 'HR'), (3, 'Engineering'), (4, 'Sales');
- This is how two tables look like:
chunchun=# SELECT * FROM employees;
emp_id | name
--------+-------
1 | Alice
2 | Bob
3 | Carol
(3 rows)
chunchun=# SELECT * FROM department;
emp_id | dept_name
--------+-------------
1 | HR
3 | Engineering
4 | Sales
(3 rows)
- Query
SELECT e.emp_id, e.name, d.department
FROM employees AS e
LEFT JOIN department AS d
ON (e.name = 'Alice' OR e.name = 'Bob');
emp_id | name | department
--------+-------+-------------
1 | Alice | HR
1 | Alice | Engineering
1 | Alice | Sales
2 | Bob | HR
2 | Bob | Engineering
2 | Bob | Sales
3 | Carol |
(7 rows)
SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
emp_id | name | department
--------+-------+------------
1 | Alice |
2 | Bob |
3 | Carol |
(3 rows)
SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'NotExist');
emp_id | name | department
--------+-------+-------------
1 | Alice | HR
1 | Alice | Engineering
1 | Alice | Sales
2 | Bob |
3 | Carol |
(5 rows)
Additional context
No response