datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Incorrect LEFT JOIN evaluation result on OR conditions

Open appletreeisyellow opened this issue 8 months ago • 7 comments

Describe the bug

When the matching condition is on columns and combined with OR, the result retuned by LEFT JOIN is similar to INNER JOIN, which is incorrect.

LEFT JOIN should not reduce the number of rows returned, i.e. it should return all the records from the left table, and the matched records from the right table. When there is no match, the result is a NULL, i.e. display as empty.

INNER JOIN reduces the number of rows, i.e. it only returns the records that have matching values in both tables.

To Reproduce

  1. Use DataFusion CLI:
# in dir datafusion/datafusion-cli
$ cargo build
$ ./target/debug/datafusion-cli
  1. Write sample data
CREATE OR REPLACE TABLE employees(emp_id INT, name VARCHAR) AS VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol');
CREATE OR REPLACE TABLE department(emp_id INT, department VARCHAR) AS VALUES (1, 'HR'), (3, 'Engineering'), (4, 'Sales');
  1. This is how employees table and department table look like
> SELECT * FROM employees;
+--------+-------+
| emp_id | name  |
+--------+-------+
| 1      | Alice |
| 2      | Bob   |
| 3      | Carol |
+--------+-------+
3 row(s) fetched.
Elapsed 0.009 seconds.

> SELECT * FROM department;
+--------+-------------+
| emp_id | department  |
+--------+-------------+
| 1      | HR          |
| 3      | Engineering |
| 4      | Sales       |
+--------+-------------+
3 row(s) fetched.
Elapsed 0.006 seconds.
  1. Query
> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'Bob');
+--------+-------+-------------+
| emp_id | name  | department  |
+--------+-------+-------------+
| 1      | Alice | HR          |
| 1      | Alice | Engineering |
| 1      | Alice | Sales       |
| 2      | Bob   | HR          |
| 2      | Bob   | Engineering |
| 2      | Bob   | Sales       |
+--------+-------+-------------+ <-- should have one more row for Carol
6 row(s) fetched.
Elapsed 0.013 seconds.

> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
+--------+------+------------+
| emp_id | name | department |
+--------+------+------------+
+--------+------+------------+ <-- should have three rows, 1 row for Alice, 1 row for Bob, and 1 row for Carol
0 row(s) fetched.
Elapsed 0.014 seconds.

> SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'NotExist');
+--------+-------+-------------+
| emp_id | name  | department  |
+--------+-------+-------------+
| 1      | Alice | HR          |
| 1      | Alice | Engineering |
| 1      | Alice | Sales       |
+--------+-------+-------------+ <-- should have two more rows, 1 row for Bob and 1 row for Carol
3 row(s) fetched.
Elapsed 0.014 seconds.

Expected behavior

Postgres shows the expected results.

  1. In psql cli, drop tables and create new tables with data
DROP TABLE IF EXISTS employees;
DROP TABLE IF EXISTS department;

# create tables
CREATE TABLE employees(emp_id INT, name VARCHAR);
CREATE TABLE department(emp_id INT, dept_name VARCHAR);

# write sample data
INSERT INTO employees (emp_id, name) VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol');
INSERT INTO department (emp_id, dept_name) VALUES (1, 'HR'), (3, 'Engineering'), (4, 'Sales');
  1. This is how two tables look like:
chunchun=# SELECT * FROM employees;
 emp_id | name
--------+-------
      1 | Alice
      2 | Bob
      3 | Carol
(3 rows)

chunchun=# SELECT * FROM department;
 emp_id |  dept_name
--------+-------------
      1 | HR
      3 | Engineering
      4 | Sales
(3 rows)
  1. Query
SELECT e.emp_id, e.name, d.department
FROM employees AS e
LEFT JOIN department AS d
ON (e.name = 'Alice' OR e.name = 'Bob');
 emp_id | name  | department
--------+-------+-------------
      1 | Alice | HR
      1 | Alice | Engineering
      1 | Alice | Sales
      2 | Bob   | HR
      2 | Bob   | Engineering
      2 | Bob   | Sales
      3 | Carol |
(7 rows)

SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
 emp_id | name  | department
--------+-------+------------
      1 | Alice |
      2 | Bob   |
      3 | Carol |
(3 rows)

SELECT e.emp_id, e.name, d.department
FROM employees e
LEFT JOIN department d
ON (e.name = 'Alice' OR e.name = 'NotExist');
 emp_id | name  | department
--------+-------+-------------
      1 | Alice | HR
      1 | Alice | Engineering
      1 | Alice | Sales
      2 | Bob   |
      3 | Carol |
(5 rows)

Additional context

No response

appletreeisyellow avatar Jun 11 '24 23:06 appletreeisyellow