datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Invalid timezone information is ignored after `Z` in timestamp literals

Open alamb opened this issue 1 year ago • 7 comments

Describe the bug

Some timestamps that should error are actally parsed, leading to confusing cases when users make a mistake

For example this timestamp is invalid '2023-12-05T21:58:10.45ZZTOP' but it is parsed as though it were '2023-12-05T21:58:10.45Z' (the trailing content is ignored)

To Reproduce

using datafusion-cli,

❯ select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
+-------------------------------------+
| Utf8("2023-12-05T21:58:10.45ZZTOP") |
+-------------------------------------+
| 2023-12-05T21:58:10.450             |
+-------------------------------------+
1 row in set. Query took 0.001 seconds.

Expected behavior

I expect the query to error, similarly to what happens if the Z is not present

❯ select TIMESTAMP '2023-12-05T21:58:10.45 hello';
Optimizer rule 'simplify_expressions' failed
caused by
Arrow error: Parser error: Invalid timezone "hello": 'hello' is not a valid timezone

This is consistent with postgres, which errors for such cases:

postgres=# select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
ERROR:  invalid input syntax for type timestamp: "2023-12-05T21:58:10.45ZZTOP"
LINE 1: select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
                         ^
postgres=# select timestamp '2023-12-05T21:58:10.45 hello';
ERROR:  invalid input syntax for type timestamp: "2023-12-05T21:58:10.45 hello"
LINE 1: select timestamp '2023-12-05T21:58:10.45 hello';
                         ^
postgres=# select timestamp '2023-12-05T21:58:10.45';
       timestamp
------------------------
 2023-12-05 21:58:10.45
(1 row)

Additional context

Kudos to @reidkaufmann for finding this downstream

This appears to be a bug in arrow: https://github.com/apache/arrow-rs/issues/5182

Once it is fixed in arrow, then we can write a test for it in DataFusion and close the issue

alamb avatar Dec 07 '23 11:12 alamb

Can I give this a shot?

razeghi71 avatar Dec 07 '23 15:12 razeghi71

Please do! I think it is beneficial to reject any timestamp literal that has unparsed trailing characters after the timezone ('Z' in the example: Zulu/UTC). Being strict has the benefit of surfacing problems earlier rather than later.

reidkaufmann avatar Dec 07 '23 18:12 reidkaufmann

I've made the change on arrow-rs in here: https://github.com/apache/arrow-rs/pull/5189, should be released in a few weeks and then we can update the version and write the tests here.

razeghi71 avatar Dec 08 '23 14:12 razeghi71

Thank you @razeghi71

alamb avatar Dec 08 '23 18:12 alamb

Is this issue still opening? I am interested in adding more test cases to test if it works as expected.

zhenglin-charlie-li avatar Feb 16 '24 04:02 zhenglin-charlie-li

Thanks @ZhengLin-Li

I double checked and the issue appears to be fixed in the latest DataFusion

Adding a test case would be most appreciated 🙏

DataFusion CLI v35.0.0
❯ select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP';
Arrow error: Parser error: Invalid timezone "ZZTOP": 'ZZTOP' is not a valid timezone

alamb avatar Feb 16 '24 13:02 alamb

Hi @alamb Thanks for your follow up.

I tried to add a test case in datafusion/sql/src/parser.rs but it turned out that the parser did not yield an error.

Expected parse error for 'select TIMESTAMP '2023-12-05T21:58:10.45ZZTOP'', but was successful: [Statement(Query(Query { with: None, body: Select(Select { distinct: None, top: None, projection: [UnnamedExpr(TypedString { data_type: Timestamp(None, None), value: "2023-12-05T21:58:10.45ZZTOP" })], into: None, from: [], lateral_views: [], selection: None, group_by: Expressions([]), cluster_by: [], distribute_by: [], sort_by: [], having: None, named_window: [], qualify: None }), order_by: [], limit: None, limit_by: [], offset: None, fetch: None, locks: [], for_clause: None }))]

See: https://github.com/ZhengLin-Li/arrow-datafusion/actions/runs/7936448427/job/21671747144?pr=1

It seems that we only check this in cli but not in parser?

zhenglin-charlie-li avatar Feb 16 '24 23:02 zhenglin-charlie-li

Hi @ZhengLin-Li ,

I think you're confusing arrow's parser with DF's parser.

The job of a parser in a programming language is to validate the syntax and build an AST. TIMESTAMP '2023-12-05T21:58:10.45ZZTOP' is syntactically correct, as TIMESTAMP 'x' and TIMESTAMP '' (empty string).

This is a semantics issue which, in a typical programming language, is checked by the semantic analysis step. In the case of SQL, and as stated in the issue's expected behavior, this should be handled by the optimizer; or worst case scenario, by the query executor (think of interpreted languages that catch semantic errors in runtime).

MohamedAbdeen21 avatar Mar 02 '24 18:03 MohamedAbdeen21