parsing parenthesis after keywords
Parsing a sql with parenthesis right after keyword without any whitespace characters (which is syntactically correct in SQL standard) will produce the belowing unexpected result:
sqlparse.parse('select * from child join(select * from parent) as p on child.parent_id=p.id;')[0].tokens
[<DML 'select' at 0x7F448E3FF980>, <Whitespace ' ' at 0x7F448E3FFA60>, <Wildcard '*' at 0x7F448E3FFAD0>,
<Whitespace ' ' at 0x7F448E3FFB40>, <Keyword 'from' at 0x7F448E3FF9F0>, <Whitespace ' ' at 0x7F448E3FFC20>,
<Identifier 'child ...' at 0x7F448E3FBA50>, <Whitespace ' ' at 0x7F448E4053D0>, <Keyword 'on' at 0x7F448E405440>,
<Whitespace ' ' at 0x7F448E4054B0>, <Comparison 'child....' at 0x7F448E3FBC50>, <Punctuation ';' at 0x7F448E405830>]
Compared with the correct result adding a whitespace after JOIN:
sqlparse.parse('select * from child join (select * from parent) as p on child.parent_id=p.id;')[0].tokens
[<DML 'select' at 0x7F448E4123D0>, <Whitespace ' ' at 0x7F448E412440>, <Wildcard '*' at 0x7F448E4124B0>,
<Whitespace ' ' at 0x7F448E412520>, <Keyword 'from' at 0x7F448E40FD70>, <Whitespace ' ' at 0x7F448E412600>,
<Identifier 'child' at 0x7F448E40BED0>, <Whitespace ' ' at 0x7F448E4126E0>, <Keyword 'join' at 0x7F448E412590>,
<Whitespace ' ' at 0x7F448E4127C0>, <Identifier '(selec...' at 0x7F448E40BD50>, <Whitespace ' ' at 0x7F448E412DE0>,
<Keyword 'on' at 0x7F448E40F2F0>, <Whitespace ' ' at 0x7F448E40F360>, <Comparison 'child....' at 0x7F448E40BA50>,
<Punctuation ';' at 0x7F448E40EDE0>]
I've noticed there were multiple issues already solving this issue for individual keywords (e.g. https://github.com/andialbrecht/sqlparse/issues/507, https://github.com/andialbrecht/sqlparse/issues/446), closed by extending this list of keywords: https://github.com/andialbrecht/sqlparse/blob/3bc7d93254cbef71bdef91905f1814201a1b1f02/sqlparse/keywords.py#L52
I found this issue while searching for a reason why Querybook, which uses sqlparse, is detecting UNNEST as a table name in e.g. Presto query ...CROSS JOIN UNNEST(t.id).... The reason is UNNEST is detected as tokens.Name instead of tokens.Keyword, because it's followed by parenthesis = exactly the same issue as it the mentioned closed issue.
So I think there are two options:
- Continue with extending the list, now with
JOINandUNNEST🙏 keywords - Approach this somehow generally
- I was thinking about replacing
(r'[A-ZÀ-Ü]\w*(?=\()', tokens.Name)with(r'[A-ZÀ-Ü]\w*(?=\()', is_keyword)here https://github.com/andialbrecht/sqlparse/blob/3bc7d93254cbef71bdef91905f1814201a1b1f02/sqlparse/keywords.py#L63 but then some tests were broken, becausesqlparse.sql.Functionwasn't detected correctly for keywords, that were handled as names before, and therefore callingget_parameters()method ofsql.Functionis failing/missing. Then it got more complicated with the grouping functions and I didn't find any good way how to fix it without breaking any unit test, yet. But seemed like correct idea in the beginning...
- I was thinking about replacing
What do you think @andialbrecht ?
Adding to the list WHERE x = ANY(...), where ANY is a keyword, not a function.