sqlglot icon indicating copy to clipboard operation
sqlglot copied to clipboard

Behavior of HEX function is different between dialects

Open viplazylmht opened this issue 9 months ago • 3 comments

Problem:

HEX function encodes a bytes or string into a hexadecimal encoding, but the result of the function is vary between dialects, due to the case of the result.

For example, in bigquery (lowercase):

select to_hex(MD5('Hello'))
|  Row  |              f0_            |  
| ----- | --------------------------- |
|   1   | 8b1a9953c4611296a827abf8c47 |

in presto/trino (uppercase):

presto> select to_hex(to_utf8('some string'));
         _col0          
------------------------
 736F6D6520737472696E67 
(1 row)

in clickhouse return uppercase. in spark (uppercase), but hex(md5_digest(x)) produce a lowercase result:

>>> spark.sql("select hex('Hello')").collect()
[Row(hex(Hello)='48656C6C6F')]
>>> spark.sql("select md5('Hello')").collect()
[Row(md5(Hello)='8b1a9953c4611296a827abf8c47804d7')]

In snowflake, the default behavior is uppercase, but we can pass an additional argument to control the case: https://docs.snowflake.com/en/sql-reference/functions/hex_encode#argumentshttps://docs.snowflake.com/en/sql-reference/functions/hex_encode#arguments

And so on...

=> So, as a transpiler tool, should we ensure the behavior of this function while transpiling?

The solution if needed:

We can add a new exp.UpperHex (or exp.LowerHex, depends on the case of default exp.Hex). Also, we add a flag to Dialect class, which mark the behavior of HEX function of the target dialect is uppercase or lowercase, then can safety and can simplify the SQL while parsing or generating the sql.

For example, if read dialect is bigquery (lowercase), write dialect is trino (uppercase), the following step will happen when transpiling the sql select lower(to_hex(col_x)):

  1. Parse sql to bigquery: becasue bq are already lowercased, the tree will look like: exp.SELECT (exp.Hex( col_x ))
  2. Generate trino sql: because trino is uppercased, generator will be generating SELECT LOWER(TO_HEX(col_x))

If we transpile select upper(to_hex(col_x)) from bigquery to trino, the code will work like this:

  1. parse sql to exp.Select( exp.UpperHex( col_x ))
  2. Generate trinno sql: SELECT TO_HEX(col_x)

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

viplazylmht avatar May 11 '24 07:05 viplazylmht

Thanks for the reports / PRs @viplazylmht, these make sense. Perhaps we could only add a single UpperHex node and treat the current one as LowerHex? Would you like to work on this?

georgesittas avatar May 11 '24 13:05 georgesittas

@georgesittas Thanks, let me finish this

viplazylmht avatar May 11 '24 18:05 viplazylmht

I found hive dialect is uppercase too. Spark inherits spark2, which inherits hive, all are the same. So, maybe most of dialects are in uppercase.

0: jdbc:hive2://localhost:10000/> select hex('Hello');
INFO  : Compiling command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3): select hex('Hello')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3); Time taken: 0.151 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3): select hex('Hello')
INFO  : Completed executing command(queryId=hive_20240512035403_4f5122ef-45ac-4a13-a743-99d0b69736a3); Time taken: 0.001 seconds
+-------------+
|     _c0     |
+-------------+
| 48656C6C6F  |
+-------------+
1 row selected (0.176 seconds)
0: jdbc:hive2://localhost:10000/>

viplazylmht avatar May 12 '24 03:05 viplazylmht