COPY TO/FROM is overly restrictive on options
Current parsing of COPY TO/FROM options is too restrictive on the how the keys & values can be provided.
https://github.com/sqlparser-rs/sqlparser-rs/blob/ce498864dc705f72e9f85d4dc5f7eba3d17b9ef6/src/parser/mod.rs#L5157-L5180
Given this parsing code, the expected format is:
COPY (SELECT 1 AS a, 2 AS b) TO 'file'
WITH (
FORMAT 'text',
FREEZE true,
DELIMITER 'd',
NULL 'null',
HEADER true,
QUOTE 'q',
ESCAPE 'e',
FORCE_QUOTE (a, b),
FORCE_NOT_NULL (a, b),
FORCE_NULL (a, b),
ENCODING 'encoding'
)
The problems:
- Key must be a keyword and cannot be a quoted string, even though postgres allows this, i.e.
WITH ("format" 'text', ...) - It strictly limits the key set at parsing time, leading to further problems:
- Need to update this parser whenever postgres supports a new key, e.g. DEFAULT (supported in postgres 16: https://www.postgresql.org/docs/16/sql-copy.html)
- We can't use this code for other databases/engines with a similar copy statement as they may have different supported keys (e.g. duckdb and datafusion, see #1080)
- Some values are too strict in their format:
- FORMAT expects an identifier, which disallows escaped strings (
format e'csv'), but this is allowed by postgres - FREEZE expects keywords
trueorfalsebut postgres allows these to be passed as string literals (single quoted, double quoted, escaped), e.g.freeze 'true'- Same for HEADER
- FORMAT expects an identifier, which disallows escaped strings (
Solutions
We could fix these problems for Postgres specifically to make it more permissive, but I was hoping to also fix it to allow usage by other engine dialects (e.g. duckdb and datafusion). This might involve removing the CopyOption enum entirely:
https://github.com/sqlparser-rs/sqlparser-rs/blob/ce498864dc705f72e9f85d4dc5f7eba3d17b9ef6/src/ast/mod.rs#L4707-L4736
And replacing its usage in Statement::Copy with a more generic version that allows any string key with a new value enum to represent different values (e.g. string value, parenthesized column list value, number value):
https://github.com/sqlparser-rs/sqlparser-rs/blob/ce498864dc705f72e9f85d4dc5f7eba3d17b9ef6/src/ast/mod.rs#L1466-L1478
I'm not sure how the legacy options will fit into this, it could just be left as is and probably tackled as part of another ticket (as the original reason here was for interop with datafusion copy statement)
I'm currently working on this, but I wanted to gather opinions on how to properly tackle this, especially in regards to how much backward compatibility (if any) we should strive to preserve. I was initially thinking of removing CopyOption enum entirely as it feels problematic, but it would be a breaking change.
Thoughts @alamb ?
Rather than removing CopyOption entirely, what about making an extension or other variant in CopyOption that is like CopyOption::Other(Name, Value) or soemthing?
Any changes like this will be a breaking change -- I think a key requirement is that sqlparser continues to be able to parse queries it is currently able to (doens't start rejecting queries it used to parse)
I think a very valuable property is that the AST that comes out is as typed as possible -- e.g. Copy options probably shouldn't just be a generic set of key/value pairs
Another potential thought might be to keep the current postgres specific copy options but rename it to PostgresCopy and only parse with the postgres dialect, and then make a more generic node for the Generic dialect (and we can add a DuckDB specific one too if we wanted)
I think a key requirement is that sqlparser continues to be able to parse queries it is currently able to (doens't start rejecting queries it used to parse)
This should be fine. In fact, it should be more permissive now. It's more about library API breaking changes.
Rather than removing
CopyOptionentirely, what about making an extension or other variant inCopyOptionthat is likeCopyOption::Other(Name, Value)or soemthing?
Yeah this is a possibility, as downstream users can then ignore the Postgres specific enum variants (though might be undesirable to have this baggage).
I think a very valuable property is that the AST that comes out is as typed as possible -- e.g. Copy options probably shouldn't just be a generic set of key/value pairs
I agree that more typing is better, but in this case the option keyset seems less part of the SQL syntax and more part of engine semantics. Although, postgres does throw an error when it doesn't recognize an option with the error pointing to the SQL so maybe I'm wrong on that:
postgres=# copy (select 1 as a, 2 as b, 3 as c) to '/tmp/123' with (format e'csv', delimiter a, freeze e'true', null test, test123 test);
ERROR: option "test123" not recognized
LINE 1: ...t e'csv', delimiter a, freeze e'true', null test, test123 te...
- Also having the keyset be unrestricted/generic and not typed to enum might make it easier for downstream users to support more/new keys (since they can implement the key/value parsing themselves), but perhaps this doesn't change often enough to warrant being a high priority
Another potential thought might be to keep the current postgres specific copy options but rename it to
PostgresCopyand only parse with the postgres dialect, and then make a more generic node for the Generic dialect (and we can add a DuckDB specific one too if we wanted)
This is probably the better approach, so users won't have to consider postgres specific options when they are checking the copy options. I'll probably explore in this direction.