datafusion
datafusion copied to clipboard
Invalid timezone information is ignored after `Z` in timestamp literals
Describe the bug
Some timestamps that should error are actally parsed, leading to confusing cases when users make a mistake
For example this timestamp is invalid '2023-12-05T21:58:10.45ZZTOP'
but it is parsed as though it were '2023-12-05T21:58:10.45Z'
(the trailing content is ignored)
To Reproduce
using datafusion-cli
,
❯ select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
+-------------------------------------+
| Utf8("2023-12-05T21:58:10.45ZZTOP") |
+-------------------------------------+
| 2023-12-05T21:58:10.450 |
+-------------------------------------+
1 row in set. Query took 0.001 seconds.
Expected behavior
I expect the query to error, similarly to what happens if the Z
is not present
❯ select TIMESTAMP '2023-12-05T21:58:10.45 hello';
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Parser error: Invalid timezone "hello": 'hello' is not a valid timezone
This is consistent with postgres, which errors for such cases:
postgres=# select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
ERROR: invalid input syntax for type timestamp: "2023-12-05T21:58:10.45ZZTOP"
LINE 1: select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
^
postgres=# select timestamp '2023-12-05T21:58:10.45 hello';
ERROR: invalid input syntax for type timestamp: "2023-12-05T21:58:10.45 hello"
LINE 1: select timestamp '2023-12-05T21:58:10.45 hello';
^
postgres=# select timestamp '2023-12-05T21:58:10.45';
timestamp
------------------------
2023-12-05 21:58:10.45
(1 row)
Additional context
Kudos to @reidkaufmann for finding this downstream
This appears to be a bug in arrow: https://github.com/apache/arrow-rs/issues/5182
Once it is fixed in arrow, then we can write a test for it in DataFusion and close the issue
Can I give this a shot?
Please do! I think it is beneficial to reject any timestamp literal that has unparsed trailing characters after the timezone ('Z' in the example: Zulu/UTC). Being strict has the benefit of surfacing problems earlier rather than later.
I've made the change on arrow-rs
in here: https://github.com/apache/arrow-rs/pull/5189, should be released in a few weeks and then we can update the version and write the tests here.
Thank you @razeghi71
Is this issue still opening? I am interested in adding more test cases to test if it works as expected.
Thanks @ZhengLin-Li
I double checked and the issue appears to be fixed in the latest DataFusion
Adding a test case would be most appreciated 🙏
DataFusion CLI v35.0.0
❯ select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
Arrow error: Parser error: Invalid timezone "ZZTOP": 'ZZTOP' is not a valid timezone
Hi @alamb Thanks for your follow up.
I tried to add a test case in datafusion/sql/src/parser.rs but it turned out that the parser did not yield an error.
Expected parse error for 'select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP'', but was successful: [Statement(Query(Query { with: None, body: Select(Select { distinct: None, top: None, projection: [UnnamedExpr(TypedString { data_type: Timestamp(None, None), value: "2023-12-05T21:58:10.45ZZTOP" })], into: None, from: [], lateral_views: [], selection: None, group_by: Expressions([]), cluster_by: [], distribute_by: [], sort_by: [], having: None, named_window: [], qualify: None }), order_by: [], limit: None, limit_by: [], offset: None, fetch: None, locks: [], for_clause: None }))]
See: https://github.com/ZhengLin-Li/arrow-datafusion/actions/runs/7936448427/job/21671747144?pr=1
It seems that we only check this in cli but not in parser?
Hi @ZhengLin-Li ,
I think you're confusing arrow's parser with DF's parser.
The job of a parser in a programming language is to validate the syntax and build an AST. TIMESTAMP '2023-12-05T21:58:10.45ZZTOP' is syntactically correct, as TIMESTAMP 'x' and TIMESTAMP '' (empty string).
This is a semantics issue which, in a typical programming language, is checked by the semantic analysis step. In the case of SQL, and as stated in the issue's expected behavior, this should be handled by the optimizer; or worst case scenario, by the query executor (think of interpreted languages that catch semantic errors in runtime).