zed icon indicating copy to clipboard operation
zed copied to clipboard

Auto-casting of join keys

Open philrz opened this issue 1 year ago • 1 comments

A user suggested that Zed automatically cast join keys as necessary to increase the likelihood of a match without needing to explicitly cast values to comparable types.

Details

Repro is with Zed commit 38763f8.

A community zync user asked the following:

Now that we have automatic sorting on joins, it would also make sense to me to do automatic casting of the left key. I can't count the number of time I had to cast() the left key, and how many hours I spent figuring out how to get my joins working beacuse of that :slightly_smiling_face: + I am guessing zed probably knows what the cast should actually be, based on the right key type ?

They offered this specific example:

...when comparing time values with string dates (e.g., “2024-03-01”). In such cases, the string dates must be cast to the appropriate time type for meaningful comparisons.

Here's a repro of that. As the user described, this first attempt fails because the left key is a string while the right key is a time.

$ zq -version
Version: v1.14.0-16-g38763f82

$ cat datestr.zson 
{datestr: "2024-03-01", word: "hello"}

$ cat time.zson 
{timeval: 2024-03-01T00:00:00Z, word: "goodbye"}

$ cat join.zed 
file datestr.zson
| inner join (
  file time.zson
) on datestr=timeval otherword:=word

$ zq -I join.zed 
[no output]

However, we can force it to work if we cast the left key to time type before the join.

$ cat join-with-cast.zed 
file datestr.zson
| datestr:=time(datestr)
| inner join (
  file time.zson
) on datestr=timeval otherword:=word

$ zq -I join-with-cast.zed 
{datestr:2024-03-01T00:00:00Z,word:"hello",otherword:"goodbye"}

While reproducing this, I also noticed a bonus limitation that might be worth its own issue, but for now I'll just log it here: For now it seems the user is stuck doing the cast upstream in the pipeline, as if I try to do the casting right in the test for equality of the keys, that causes a syntax error.

$ cat join-other-cast.zed 
file datestr.zson
| inner join (
  file time.zson
) on time(datestr)=timeval otherword:=word

$ zq -I join-other-cast.zed 
zq: error parsing Zed in join-other-cast.zed at line 4, column 10:
) on time(datestr)=timeval otherword:=word
     === ^ ===

philrz avatar Mar 14 '24 19:03 philrz

I revisited this issue in the SuperSQL era as of super commit 47b1fb7, as we're now using SQL (specifically Postgres) as a baseline for functionality we may strive to match. Walking through some of the combinations of type comparisons in JOIN, there is more flexibility in Postgres when it comes to the specific example of comparing date & time values as the user originally asked about, and even more auto-cast flexibility in other systems.

Details

Here's a summary of which systems successfully JOIN between equal-ish values of some different types.

Type 1 Type 2 Postgres 17.5 MySQL 9.3.0 DuckDB v1.3.1 ClickHouse 25.6.2.5 SuperDB 47b1fb7
INTEGER REAL
INTEGER VARCHAR
TIMESTAMP DATE N/A
TIMESTMAP STRING

Test Outputs

Postgres

$ psql postgres
psql (17.5 (Homebrew))
Type "help" for help.

postgres=# CREATE TABLE foo (
  int_val       INTEGER,
  real_val      REAL,
  str_int_val   VARCHAR(255),
  time_val      TIMESTAMP,
  date_val      DATE,
  str_time_val  VARCHAR(255)
  );

INSERT INTO foo VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');

CREATE TABLE bar (
  int_val       INTEGER,
  real_val      REAL,
  str_int_val   VARCHAR(255),
  time_val      TIMESTAMP,
  date_val      DATE,
  str_time_val  VARCHAR(255)
  );

INSERT INTO bar VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');
CREATE TABLE
INSERT 0 1
CREATE TABLE
INSERT 0 1

postgres=# SELECT * FROM foo JOIN bar ON foo.int_val=bar.real_val;
 int_val | real_val | str_int_val |      time_val       |  date_val  |    str_time_val     | int_val | real_val | str_int_val |      time_val       |  date_val  |    str_time_val     
---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------
       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00
(1 row)

postgres=# SELECT * FROM foo JOIN bar ON foo.int_val=bar.str_int_val;
ERROR:  operator does not exist: integer = character varying
LINE 1: SELECT * FROM foo JOIN bar ON foo.int_val=bar.str_int_val;
                                                 ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

postgres=# SELECT * FROM foo JOIN bar ON foo.time_val=bar.date_val;
 int_val | real_val | str_int_val |      time_val       |  date_val  |    str_time_val     | int_val | real_val | str_int_val |      time_val       |  date_val  |    str_time_val     
---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------
       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00
(1 row)

postgres=# SELECT * FROM foo JOIN bar ON foo.time_val=bar.str_time_val;
ERROR:  operator does not exist: timestamp without time zone = character varying
LINE 1: SELECT * FROM foo JOIN bar ON foo.time_val=bar.str_time_val;
                                                  ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

MySQL

$ mysqlsh
MySQL Shell 9.3.0

Copyright (c) 2016, 2025, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.

Type '\help' or '\?' for help; '\quit' to exit.
 MySQL  SQL > \connect root@localhost
Creating a session to 'root@localhost'
Fetching global names for auto-completion... Press ^C to stop.
Your MySQL connection id is 19
Server version: 9.3.0 Homebrew
No default schema selected; type \use <schema> to set one.
 MySQL  localhost:3306 ssl  SQL > \use foo
Default schema set to `foo`.
Fetching global names, object names from `foo` for auto-completion... Press ^C to stop.
 MySQL  localhost:3306 ssl  foo  SQL > CREATE TABLE foo (
                                    ->   int_val       INTEGER,
                                    ->   real_val      REAL,
                                    ->   str_int_val   VARCHAR(255),
                                    ->   time_val      TIMESTAMP,
                                    ->   date_val      DATE,
                                    ->   str_time_val  VARCHAR(255)
                                    ->   );
Query OK, 0 rows affected (0.0062 sec)
 MySQL  localhost:3306 ssl  foo  SQL > 
 MySQL  localhost:3306 ssl  foo  SQL > INSERT INTO foo VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');
Query OK, 1 row affected (0.0013 sec)
 MySQL  localhost:3306 ssl  foo  SQL > 
 MySQL  localhost:3306 ssl  foo  SQL > CREATE TABLE bar (
                                    ->   int_val       INTEGER,
                                    ->   real_val      REAL,
                                    ->   str_int_val   VARCHAR(255),
                                    ->   time_val      TIMESTAMP,
                                    ->   date_val      DATE,
                                    ->   str_time_val  VARCHAR(255)
                                    ->   );
Query OK, 0 rows affected (0.0054 sec)

 MySQL  localhost:3306 ssl  foo  SQL > INSERT INTO bar VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');
Query OK, 1 row affected (0.0015 sec)
 MySQL  localhost:3306 ssl  foo  SQL > SELECT * FROM foo JOIN bar ON foo.int_val=bar.real_val;
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
| int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        | int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
|       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
1 row in set (0.0004 sec)

 MySQL  localhost:3306 ssl  foo  SQL > SELECT * FROM foo JOIN bar ON foo.int_val=bar.str_int_val;
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
| int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        | int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
|       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
1 row in set (0.0005 sec)

 MySQL  localhost:3306 ssl  foo  SQL > SELECT * FROM foo JOIN bar ON foo.time_val=bar.date_val;
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
| int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        | int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
|       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
1 row in set (0.0004 sec)

 MySQL  localhost:3306 ssl  foo  SQL > SELECT * FROM foo JOIN bar ON foo.time_val=bar.str_time_val;
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
| int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        | int_val | real_val | str_int_val | time_val            | date_val   | str_time_val        |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
|       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |       3 |        3 | 3           | 2025-07-03 00:00:00 | 2025-07-03 | 2025-07-03 00:00:00 |
+---------+----------+-------------+---------------------+------------+---------------------+---------+----------+-------------+---------------------+------------+---------------------+
1 row in set (0.0004 sec)

DuckDB

$ duckdb foo.db
DuckDB v1.3.1 (Ossivalis) 2063dda3e6
Enter ".help" for usage hints.
D CREATE TABLE foo (
    int_val       INTEGER,
    real_val      REAL,
    str_int_val   VARCHAR(255),
    time_val      TIMESTAMP,
    date_val      DATE,
    str_time_val  VARCHAR(255)
    );
D 
D INSERT INTO foo VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');
D 
D CREATE TABLE bar (
    int_val       INTEGER,
    real_val      REAL,
    str_int_val   VARCHAR(255),
    time_val      TIMESTAMP,
    date_val      DATE,
    str_time_val  VARCHAR(255)
    );
D 
D INSERT INTO bar VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');

D SELECT * FROM foo JOIN bar ON foo.int_val=bar.real_val;
┌─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┬─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┐
│ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │
│  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │
├─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┼─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┤
│    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │
└─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┴─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┘

D SELECT * FROM foo JOIN bar ON foo.int_val=bar.str_int_val;
┌─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┬─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┐
│ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │
│  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │
├─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┼─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┤
│    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │
└─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┴─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┘

D SELECT * FROM foo JOIN bar ON foo.time_val=bar.date_val;
┌─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┬─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┐
│ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │
│  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │
├─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┼─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┤
│    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │
└─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┴─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┘

D SELECT * FROM foo JOIN bar ON foo.time_val=bar.str_time_val;
┌─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┬─────────┬──────────┬─────────────┬─────────────────────┬────────────┬─────────────────────┐
│ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │ int_val │ real_val │ str_int_val │      time_val       │  date_val  │    str_time_val     │
│  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │  int32  │  float   │   varchar   │      timestamp      │    date    │       varchar       │
├─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┼─────────┼──────────┼─────────────┼─────────────────────┼────────────┼─────────────────────┤
│    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │    3    │   3.0    │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │
└─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┴─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┘

ClickHouse

$ clickhouse
ClickHouse local version 25.6.2.5 (official build).

:) CREATE TABLE foo (
  int_val       INTEGER,
  rCREATE TABLE foo (
  int_val       INTEGER,
  real_val      REAL,
  str_int_val   VARCHAR(255),
  time_val      TIMESTAMP,
  date_val      DATE,
  str_time_val  VARCHAR(255)
  )
  ENGINE = MergeTree
  ORDER BY tuple();

INSERT INTO foo VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');

CREATE TABLE bar (
  int_val       INTEGER,
  real_val      REAL,
  str_int_val   VARCHAR(255),
  time_val      TIMESTAMP,
  date_val      DATE,
  str_time_val  VARCHAR(255)
  )
  ENGINE = MergeTree
  ORDER BY tuple();

INSERT INTO bar VALUES (3, 3.0, '3', '2025-07-03 00:00:00', '2025-07-03', '2025-07-03 00:00:00');

CREATE TABLE foo
(
    `int_val` INTEGER,
    `real_val` REAL,
    `str_int_val` VARCHAR(255),
    `time_val` TIMESTAMP,
    `date_val` DATE,
    `str_time_val` VARCHAR(255)
)
ENGINE = MergeTree
ORDER BY tuple()

Query id: e3dc3611-e552-411a-91c5-54ceb192f3fd

Ok.

0 rows in set. Elapsed: 0.007 sec. 

INSERT INTO foo FORMAT Values

Query id: a10c76cd-9c29-4779-ba70-ce8871bff944

Ok.

1 row in set. Elapsed: 0.004 sec. 

CREATE TABLE bar
(
    `int_val` INTEGER,
    `real_val` REAL,
    `str_int_val` VARCHAR(255),
    `time_val` TIMESTAMP,
    `date_val` DATE,
    `str_time_val` VARCHAR(255)
)
ENGINE = MergeTree
ORDER BY tuple()

Query id: e5728578-bf10-40d2-9abe-3af7d673cb6e

Ok.

0 rows in set. Elapsed: 0.002 sec. 


INSERT INTO bar FORMAT Values

Query id: 97d0b576-7004-455d-943b-3bbb8711334b

Ok.

1 row in set. Elapsed: 0.003 sec. 

:) SELECT * FROM foo JOIN bar ON foo.int_val=bar.real_val;

SELECT *
FROM foo
INNER JOIN bar ON foo.int_val = bar.real_val

Query id: 2e32a95d-3fc4-43b1-8488-a106abf7f342

   ┌─int_val─┬─real_val─┬─str_int_val─┬────────────time_val─┬───date_val─┬─str_time_val────────┬─bar.int_val─┬─bar.real_val─┬─bar.str_int_val─┬────────bar.time_val─┬─bar.date_val─┬─bar.str_time_val────┐
1. │       3 │        3 │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │           3 │            3 │ 3               │ 2025-07-03 00:00:00 │   2025-07-03 │ 2025-07-03 00:00:00 │
   └─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┴─────────────┴──────────────┴─────────────────┴─────────────────────┴──────────────┴─────────────────────┘

1 row in set. Elapsed: 0.005 sec. 

:) SELECT * FROM foo JOIN bar ON foo.int_val=bar.str_int_val;

SELECT *
FROM foo
INNER JOIN bar ON foo.int_val = bar.str_int_val

Query id: 3c85905a-7085-454f-af04-2ee1fec58157

Elapsed: 0.122 sec. 

Received exception:
Code: 386. DB::Exception: There is no supertype for types Int32, String because some of them are String/FixedString/Enum and some of them are not: JOIN cannot infer common type in ON section for keys. Left key '__table1.int_val' type Int32. Right key '__table2.str_int_val' type String. (NO_COMMON_TYPE)

:) SELECT * FROM foo JOIN bar ON foo.time_val=bar.date_val;

SELECT *
FROM foo
INNER JOIN bar ON foo.time_val = bar.date_val

Query id: 3af5ed6e-23ad-4aeb-9cf6-397096484f3d
   ┌─int_val─┬─real_val─┬─str_int_val─┬────────────time_val─┬───date_val─┬─str_time_val────────┬─bar.int_val─┬─bar.real_val─┬─bar.str_int_val─┬────────bar.time_val─┬─bar.date_val─┬─bar.str_time_val────┐
1. │       3 │        3 │ 3           │ 2025-07-03 00:00:00 │ 2025-07-03 │ 2025-07-03 00:00:00 │           3 │            3 │ 3               │ 2025-07-03 00:00:00 │   2025-07-03 │ 2025-07-03 00:00:00 │
   └─────────┴──────────┴─────────────┴─────────────────────┴────────────┴─────────────────────┴─────────────┴──────────────┴─────────────────┴─────────────────────┴──────────────┴─────────────────────┘

1 row in set. Elapsed: 0.004 sec. 

:) SELECT * FROM foo JOIN bar ON foo.time_val=bar.str_time_val;

SELECT *
FROM foo
INNER JOIN bar ON foo.time_val = bar.str_time_val

Query id: 174d3395-8d5b-4a0b-8f18-a3afc49ae88a

Elapsed: 0.001 sec. 

Received exception:
Code: 386. DB::Exception: There is no supertype for types DateTime, String because some of them are String/FixedString/Enum and some of them are not: JOIN cannot infer common type in ON section for keys. Left key '__table1.time_val' type DateTime. Right key '__table2.str_time_val' type String. (NO_COMMON_TYPE)

SuperDB

SuperDB currently lacks a DATE type so that permutation is excluded. Also, due to the lack of support for static type checking (#5997) failure of the equality check in JOIN is currently communicated by lack of output rather than an explicit error message about mismatched types.

$ super -version
Version: 47b1fb7d8

$ cat foo.sup
{int_val:3, real_val:3.0, str_int_val:"3", time_val:2025-07-03T00:00:00Z, str_time_val:"2025-07-03T00:00:00Z"}

$ cat bar.sup
{int_val:3, real_val:3.0, str_int_val:"3", time_val:2025-07-03T00:00:00Z, str_time_val:"2025-07-03T00:00:00Z"}

$ super -c "SELECT * FROM foo.sup AS foo JOIN bar.sup AS bar ON foo.int_val=bar.real_val;"
{int_val:3,real_val:3.,str_int_val:"3",time_val:2025-07-03T00:00:00Z,str_time_val:"2025-07-03T00:00:00Z"}

$ super -c "SELECT * FROM foo.sup AS foo JOIN bar.sup AS bar ON foo.int_val=bar.str_int_val;"
[no output]

$ super -c "SELECT * FROM foo.sup AS foo JOIN bar.sup AS bar ON foo.time_val=bar.str_time_val;"
[no output]

philrz avatar Jul 05 '25 21:07 philrz