Apparently wrong position of RawStmt in multi-statements script
Hi,
I am investigating on an issue reported on pglast, where the stmt_location of the RawStmt are not consistent, when feeding a multi-statements SQL to pg_query_parse() function. I could not figure either a reason, or a workaround.
A little detail: pglast uses the location/stmt_location only when it asked to preserve the comments in the statement. In that case, it first collects all the comments using pg_query_scan(), and then it does its best to emit them as soon as it reprints (deparse, in libpg_query parlance) each node, roughly comparing the node position with the one of the collected comments.
Consider the reported sample SQL:
-- First comment
SELECT * FROM first_table;
-- Second comment
SELECT * FROM second_table;
Adding it to examples/scan.c, I get the following:
version: 170000, tokens: 12, size: 114
"-- First comment" = [ 0, 16, SQL_COMMENT, NO_KEYWORD ]
"SELECT" = [ 17, 23, SELECT, RESERVED_KEYWORD ]
"*" = [ 24, 25, ASCII_42, NO_KEYWORD ]
"FROM" = [ 26, 30, FROM, RESERVED_KEYWORD ]
"first_table" = [ 31, 42, IDENT, NO_KEYWORD ]
";" = [ 42, 43, ASCII_59, NO_KEYWORD ]
"-- Second comment" = [ 45, 62, SQL_COMMENT, NO_KEYWORD ]
"SELECT" = [ 63, 69, SELECT, RESERVED_KEYWORD ]
"*" = [ 70, 71, ASCII_42, NO_KEYWORD ]
"FROM" = [ 72, 76, FROM, RESERVED_KEYWORD ]
"second_table" = [ 77, 89, IDENT, NO_KEYWORD ]
";" = [ 89, 90, ASCII_59, NO_KEYWORD ]
Notice the position of the second SELECT: 63.
Using pg_query_parse() I get this (reformatted, for clarity):
{
"version": 170000,
"stmts": [
{
"stmt": {
"SelectStmt": {
"targetList": [
{
"ResTarget": {
"val": {
"ColumnRef": {
"fields": [
{
"A_Star": {}
}
],
"location": 24
}
},
"location": 24
}
}
],
"fromClause": [
{
"RangeVar": {
"relname": "first_table",
"inh": true,
"relpersistence": "p",
"location": 31
}
}
],
"limitOption": "LIMIT_OPTION_DEFAULT",
"op": "SETOP_NONE"
}
},
"stmt_len": 42
},
{
"stmt": {
"SelectStmt": {
"targetList": [
{
"ResTarget": {
"val": {
"ColumnRef": {
"fields": [
{
"A_Star": {}
}
],
"location": 70
}
},
"location": 70
}
}
],
"fromClause": [
{
"RangeVar": {
"relname": "second_table",
"inh": true,
"relpersistence": "p",
"location": 77
}
}
],
"limitOption": "LIMIT_OPTION_DEFAULT",
"op": "SETOP_NONE"
}
},
"stmt_location": 43,
"stmt_len": 46
}
]
}
As you can see, the second RawStmt is reported to start at 43, just after the semicolon ending the first SELECT, that is clearly wrong, while all other sub-nodes location are correct.
Thanks in advance for any clue!
As you can see, the second RawStmt is reported to start at 43, just after the semicolon ending the first SELECT, that is clearly wrong, while all other sub-nodes location are correct.
This is explained by the following change done in Postgres 18, that I think fixes the issue at hand: https://github.com/postgres/postgres/commit/14e5680eee19df8b41ef77330d0b3857f498e4f7
Once we regenerate the parser for Postgres 18 this will resolve itself.
Since this is a backwards incompatible change (a user of the library may rely on the current behavior where the comment before each statement is pulled in when splitting based on stmt_location), whilst we could theoretically try backporting the patch to resolve this, I'd be inclined to not modify this behavior in the 17 branch.
Can you workaround this problem in pglast itself?
Thank you for the clarification.
Can you workaround this problem in pglast itself?
I will try again and see if it's doable... before v18 comes out 😉
This seems effectively solved with the new parser, thank you!