Expose libpq client lexer
Currently, the provided parser aborts processing on any error. This is problematic for anything that would like to resume on processing, for example, lsp type implementations that would want to parse larger files and not be limited to stopping on first error.
The libpq lexer would do this and guarantee that statements would be 1:1 as to how they are seen in psql, and would additionally allow for other psql commands like backslash commands \if, \endif etc to be at least removed or possibly even handled the same way the server would handle them, which would lower the downstream parsing burden and improve compatibility with the server parsing outcomes.
Hacking around a bit, I was able to work out a simple program that exposed psqlscan.l and related functions and do some basic lexing that could sit in front of an lsp implementation. To do this though, I had to rip out various things and hand compile, the postgres build system is complex and I was not able to get it to work out of standard make. I'm thinking that this the right project to utilize to solve that particular problem, basically, expose the statement lexer and return a list of high level primitives (statement, backslash command etc) along with character start/end in the strem, and return them. This could be then fed back into the parsing API or utilized in some other deeper parsing implementation.
What do you think?
Thanks for reaching out! To capture my understanding:
- The main use case here is statement splitting
- You're looking to recover better from errors, so the parser is not a fit for this
It was briefly mentioned in the Postgres LSP issue, but I'm not sure if you saw that: libpg_query already exposes the regular Postgres query lexer in pg_query_scan, which is also used to offer pg_query_split_with_scanner.
Can you confirm that besides the missing psql metacommand support, this existing facility also has other challenges that make it not a fit for the use case?
Definitely open to adding the psql lexer as well (it should be possible), but maintaining any additional scan/parse code carries a maintenance burden, and so I'd like to fully understand the use case first :)
Sure, thanks for responding. The basic issue is that pg_query_scan sits over backend try / catch macros and aborts when seeing an error. This is fine for the backend implementation as it's basically processing one query at a time. I did try and whack around the bison component a bit and got no where, this is all black magic to me :-). This is all memorialized PgQuerySplitResult with it's single returned error. A partially returned parse state in event of error would also be wonderful but I don't know if that's possible.
What's missing here is front end pre-processing in the pg_query_scan interface, so lex out the commands first before runnning bison parsing, exactly as is one in libpq. Client side paring is extremely simple, offering only 4 parsing outcomes, with a tiny bit of complexity on the backslash commands to eat up the tokens.
See here; if a statement splitter could break up statements, this would allow for a more streaming style interface that only maintains context command to command. This could bolt on to the pg_query api with something like,
pg_query_get_command(const char * input) -> struct { comand_type, length int}
command_type being one of, backslash or query, more or less directly proxying the scan result interface That could then be passed back into pg_query_parse, copying first to set length for local parse since pg_query_parse does not take input length, then repeating. There maybe better ways to do this, just framing the general need.
Edit: adding a little more context on general use case. LSP implementations will be very kneecapped since they would be operating from a file basically, and would want to focus on 'this' statement, the one the user is editing, only. Any parsing fuss above that in the file itself would render that unfunctional.
In order to get something like this compiled I cribbed a bit from pg_bench. pg_bench has a very similar use case: pull off statements from a user provided file, process metacommands locally if needed, and send the rest out. In pg_query's case, the 'rest out' would be to downstream parsing. Multi statement input would then be at absolute parity with psql, which IMNSHO is a good thing :-).
In order to hand compile a simple statement splitter, I copied and pasted exprscan.c (itself being generated from exprscan.l during standard compilation, then hacked my own headerfile via a reduced pg_bench.h trimming out the various things not related to statement parsing. Also had to inline strtoint64 and strtodouble for some reason. This was complied using backend, not frontend or extension compiling environment. Basically a hacked up and simplied pg_bench.
I'm not very good with Make, so I'd need to study a bit how you'd get this into pg_query if that was the general inclination. The flex dependency might be a concern; not sure.