datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

build(deps): upgrade sqlparser to 0.46.0

Open tisonkun opened this issue 1 year ago • 5 comments

Upgrade sqlparser to 0.46.0. This should handle a few of breaking changes:

  • [x] Insert and Delete are now encapsulated in a struct.
  • [ ] JsonAccess refactor - https://github.com/sqlparser-rs/sqlparser-rs/pull/1215/
  • [x] SQLFunction refactor - https://github.com/sqlparser-rs/sqlparser-rs/pull/1247
  • [x] Window Function refactor - https://github.com/sqlparser-rs/sqlparser-rs/pull/1237

This is working in progress.

tisonkun avatar May 06 '24 14:05 tisonkun

@jmhain Thanks for your patch. I've applied it to this branch. Let us continue fixing the breaking changes here when we have time :D

tisonkun avatar May 07 '24 15:05 tisonkun

Pushed a commit to handle other breaking changes. Now remains the JsonAccess and Agg func refactor. I'm still understanding how they are logically changed ...

tisonkun avatar May 08 '24 10:05 tisonkun

Seems that we should merge AggregateExpressionWithFilter and ArrayAgg handle logic into sql_function_to_expr.

tisonkun avatar May 14 '24 02:05 tisonkun

I am going to give this PR a look to see if I can help unstick it

alamb avatar May 14 '24 20:05 alamb

There is still a failure which I haven't had a chance to look into:

cargo test --test sqllogictests -- array
...
Running "array_query.slt"
Running "array.slt"
External error: query failed: DataFusion error: SQL error: ParserError("Expected variant object key name, found: 2")
[SQL] select make_array(1, 2, 3)[1:2], make_array(1.0, 2.0, 3.0)[2:3], make_array('h', 'e', 'l', 'l', 'o')[2:4];
at test_files/array.slt:859

Error: Execution("1 failures")
error: test failed, to rerun pass `-p datafusion-sqllogictest --test sqllogictests`

Caused by:
  process didn't exit successfully: `/Users/andrewlamb/Software/datafusion/target/debug/deps/sqllogictests-b3a0a4f302be541c array` (exit status: 1)

alamb avatar May 14 '24 20:05 alamb

I'm testing on @jmhain's branch https://github.com/sqlparser-rs/sqlparser-rs/pull/1290 on the subscript fixes and encounter the following new API breaks:

   Compiling datafusion-sql v38.0.0 (/Users/tison/Brittani/arrow-datafusion/datafusion/sql)
error[E0599]: no variant named `ArrayIndex` found for enum `sqlparser::ast::Expr`
   --> datafusion/sql/src/expr/mod.rs:197:22
    |
197 |             SQLExpr::ArrayIndex { obj, indexes } => {
    |                      ^^^^^^^^^^ variant not found in `sqlparser::ast::Expr`

error[E0599]: no variant or associated item named `RawStringLiteral` found for enum `sqlparser::ast::Value` in the current scope
  --> datafusion/sql/src/statement.rs:78:18
   |
78 |         | Value::RawStringLiteral(_)
   |                  ^^^^^^^^^^^^^^^^
   |                  |
   |                  variant or associated item not found in `Value`
   |                  help: there is a variant with a similar name: `HexStringLiteral`

error[E0026]: variant `sqlparser::ast::Statement::SetVariable` does not have a field named `variable`
   --> datafusion/sql/src/statement.rs:232:17
    |
232 |                 variable,
    |                 ^^^^^^^^
    |                 |
    |                 variant `sqlparser::ast::Statement::SetVariable` does not have this field
    |                 help: a field with a similar name exists: `variables`

   Compiling datafusion-functions-aggregate v38.0.0 (/Users/tison/Brittani/arrow-datafusion/datafusion/functions-aggregate)
error[E0004]: non-exhaustive patterns: `FunctionArgumentClause::Having(_)` and `FunctionArgumentClause::Separator(_)` not covered
    --> datafusion/sql/src/expr/function.rs:142:19
     |
142  |             match clause {
     |                   ^^^^^^ patterns `FunctionArgumentClause::Having(_)` and `FunctionArgumentClause::Separator(_)` not covered
     |
note: `FunctionArgumentClause` defined here
    --> /Users/tison/.cargo/git/checkouts/sqlparser-rs-7df4902af8c87ae3/bd573b2/src/ast/mod.rs:4995:1
     |
4995 | pub enum FunctionArgumentClause {
     | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
5023 |     Having(HavingBound),
     |     ------ not covered
...
5027 |     Separator(Value),
     |     --------- not covered
     = note: the matched value is of type `FunctionArgumentClause`
help: ensure that all possible cases are being handled by adding a match arm with a wildcard pattern, a match arm with multiple or-patterns as shown, or multiple match arms
     |
166  ~                 },
167  +                 FunctionArgumentClause::Having(_) | FunctionArgumentClause::Separator(_) => todo!()
     |

Also related to -

  • [x] https://github.com/sqlparser-rs/sqlparser-rs/pull/1262
  • [x] https://github.com/sqlparser-rs/sqlparser-rs/pull/1252
  • [x] https://github.com/sqlparser-rs/sqlparser-rs/pull/1258
  • [x] https://github.com/sqlparser-rs/sqlparser-rs/pull/1256

tisonkun avatar May 28 '24 22:05 tisonkun

I've handled all other trivial new ASTs. But how to migrate SQLExpr::ArrayIndex to SQLExpr::Subscript still need some time to work out. Perhaps @jmhain can take a look and patching this.

tisonkun avatar May 28 '24 22:05 tisonkun

Ensure all the sqllogictests passed without the array index cases:

Details
### Array index


## array[i]

# single index with scalars #1 (positive index)
query IRT
select make_array(1, 2, 3)[1], make_array(1.0, 2.0, 3.0)[2], make_array('h', 'e', 'l', 'l', 'o')[3];
----
1 2 l

# single index with scalars #2 (zero index)
query I
select make_array(1, 2, 3)[0];
----
NULL

# single index with scalars #3 (negative index)
query IRT
select make_array(1, 2, 3)[-1], make_array(1.0, 2.0, 3.0)[-2], make_array('h', 'e', 'l', 'l', 'o')[-3];
----
3 2 l

# single index with scalars #4 (complex index)
query IRT
select make_array(1, 2, 3)[1 + 2 - 1], make_array(1.0, 2.0, 3.0)[2 * 1 * 0 - 2], make_array('h', 'e', 'l', 'l', 'o')[2 - 3];
----
2 2 o

# single index with columns #1 (positive index)
query ?RT
select column1[2], column2[3], column3[1] from arrays;
----
[3, ] 3.3 L
[5, 6] 6.6 i
[7, 8] 9.9 d
[9, 10] 12.2 s
NULL 15.5 a
[13, 14] NULL ,
[, 18] 18.8 NULL

# single index with columns #2 (zero index)
query ?RT
select column1[0], column2[0], column3[0] from arrays;
----
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL

# single index with columns #3 (negative index)
query ?RT
select column1[-2], column2[-3], column3[-1] from arrays;
----
[, 2] 1.1 m
[3, 4] NULL m
[5, 6] 7.7 r
[7, ] 10.1 t
NULL 13.3 t
[11, 12] NULL ,
[15, 16] 16.6 NULL

# single index with columns #4 (complex index)
query ?RT
select column1[9 - 7], column2[2 * 0], column3[1 - 3] from arrays;
----
[3, ] NULL e
[5, 6] NULL u
[7, 8] NULL o
[9, 10] NULL i
NULL NULL e
[13, 14] NULL NULL
[, 18] NULL NULL

# TODO: support index as column
# single index with columns #5 (index as column)
# query ?
# select make_array(1, 2, 3, 4, 5)[column2] from arrays_with_repeating_elements;
# ----

# TODO: support argument and index as columns
# single index with columns #6 (argument and index as columns)
# query I
# select column1[column2] from arrays_with_repeating_elements;
# ----

## array[i:j]

# multiple index with columns #1 (positive index)
query ???
select make_array(1, 2, 3)[1:2], make_array(1.0, 2.0, 3.0)[2:3], make_array('h', 'e', 'l', 'l', 'o')[2:4];
----
[1, 2] [2.0, 3.0] [e, l, l]

query ???
select arrow_cast([1, 2, 3], 'LargeList(Int64)')[1:2],
       arrow_cast([1.0, 2.0, 3.0], 'LargeList(Int64)')[2:3],
       arrow_cast(['h', 'e', 'l', 'l', 'o'], 'LargeList(Utf8)')[2:4]
;
----
[1, 2] [2, 3] [e, l, l]

# multiple index with columns #2 (zero index)
query ???
select make_array(1, 2, 3)[0:0], make_array(1.0, 2.0, 3.0)[0:2], make_array('h', 'e', 'l', 'l', 'o')[0:6];
----
[] [1.0, 2.0] [h, e, l, l, o]

query ???
select arrow_cast([1, 2, 3], 'LargeList(Int64)')[0:0],
       arrow_cast([1.0, 2.0, 3.0], 'LargeList(Int64)')[0:2],
       arrow_cast(['h', 'e', 'l', 'l', 'o'], 'LargeList(Utf8)')[0:6]
;
----
[] [1, 2] [h, e, l, l, o]

query I
select arrow_cast([1, 2, 3], 'LargeList(Int64)')[1];
----
1

# TODO: support multiple negative index
# multiple index with columns #3 (negative index)
# query II
# select make_array(1, 2, 3)[-3:-1], make_array(1.0, 2.0, 3.0)[-3:-1], make_array('h', 'e', 'l', 'l', 'o')[-2:0];
# ----

# TODO: support complex index
# multiple index with columns #4 (complex index)
# query III
# select make_array(1, 2, 3)[2 + 1 - 1:10], make_array(1.0, 2.0, 3.0)[2 | 2:10], make_array('h', 'e', 'l', 'l', 'o')[6 ^ 6:10];
# ----

# multiple index with columns #1 (positive index)
query ???
select column1[2:4], column2[1:4], column3[3:4] from arrays;
----
[[3, ]] [1.1, 2.2, 3.3] [r, e]
[[5, 6]] [, 5.5, 6.6] [, u]
[[7, 8]] [7.7, 8.8, 9.9] [l, o]
[[9, 10]] [10.1, , 12.2] [t]
[] [13.3, 14.4, 15.5] [e, t]
[[13, 14]] [] []
[[, 18]] [16.6, 17.7, 18.8] []

# multiple index with columns #2 (zero index)
query ???
select column1[0:5], column2[0:3], column3[0:9] from arrays;
----
[[, 2], [3, ]] [1.1, 2.2, 3.3] [L, o, r, e, m]
[[3, 4], [5, 6]] [, 5.5, 6.6] [i, p, , u, m]
[[5, 6], [7, 8]] [7.7, 8.8, 9.9] [d, , l, o, r]
[[7, ], [9, 10]] [10.1, , 12.2] [s, i, t]
[] [13.3, 14.4, 15.5] [a, m, e, t]
[[11, 12], [13, 14]] [] [,]
[[15, 16], [, 18]] [16.6, 17.7, 18.8] []

# TODO: support negative index
# multiple index with columns #3 (negative index)
# query ?RT
# select column1[-2:-4], column2[-3:-5], column3[-1:-4] from arrays;
# ----
# [, 2] 1.1 m

# TODO: support complex index
# multiple index with columns #4 (complex index)
# query ?RT
# select column1[9 - 7:2 + 2], column2[1 * 0:2 * 3], column3[1 + 1 - 0:5 % 3] from arrays;
# ----

# TODO: support first index as column
# multiple index with columns #5 (first index as column)
# query ?
# select make_array(1, 2, 3, 4, 5)[column2:4] from arrays_with_repeating_elements
# ----

# TODO: support last index as column
# multiple index with columns #6 (last index as column)
# query ?RT
# select make_array(1, 2, 3, 4, 5)[2:column3] from arrays_with_repeating_elements;
# ----

# TODO: support argument and indices as column
# multiple index with columns #7 (argument and indices as column)
# query ?RT
# select column1[column2:column3] from arrays_with_repeating_elements;
# ----

# array[i:j:k]

# multiple index with columns #1 (positive index)
query ???
select make_array(1, 2, 3)[1:2:2], make_array(1.0, 2.0, 3.0)[2:3:2], make_array('h', 'e', 'l', 'l', 'o')[2:4:2];
----
[1] [2.0] [e, l]

# multiple index with columns #2 (zero index)
query ???
select make_array(1, 2, 3)[0:0:2], make_array(1.0, 2.0, 3.0)[0:2:2], make_array('h', 'e', 'l', 'l', 'o')[0:6:2];
----
[] [1.0] [h, l, o]

#TODO: sqlparser does not support negative index
## multiple index with columns #3 (negative index)
#query ???
#select make_array(1, 2, 3)[-1:-2:-2], make_array(1.0, 2.0, 3.0)[-2:-3:-2], make_array('h', 'e', 'l', 'l', 'o')[-2:-4:-2];
#----
#[1] [2.0] [e, l]

# multiple index with columns #1 (positive index)
query ???
select column1[2:4:2], column2[1:4:2], column3[3:4:2] from arrays;
----
[[3, ]] [1.1, 3.3] [r]
[[5, 6]] [, 6.6] []
[[7, 8]] [7.7, 9.9] [l]
[[9, 10]] [10.1, 12.2] [t]
[] [13.3, 15.5] [e]
[[13, 14]] [] []
[[, 18]] [16.6, 18.8] []

# multiple index with columns #2 (zero index)
query ???
select column1[0:5:2], column2[0:3:2], column3[0:9:2] from arrays;
----
[[, 2]] [1.1, 3.3] [L, r, m]
[[3, 4]] [, 6.6] [i, , m]
[[5, 6]] [7.7, 9.9] [d, l, r]
[[7, ]] [10.1, 12.2] [s, t]
[] [13.3, 15.5] [a, e]
[[11, 12]] [] [,]
[[15, 16]] [16.6, 18.8] []

tisonkun avatar May 28 '24 23:05 tisonkun

Thanks @tisonkun -- I plan to review this and other sqlparser stuff tomorrow

alamb avatar May 30 '24 00:05 alamb

I have been playing around with this PR locally. I think we need to make some additional changes to how SQL planning works for these structured field accesses, but I think it will be simpler when we are done. I'll push some changes shortly

I also see what else is needed upstream in sqlparser and I'll try and add it to @jmhain 's branch too

alamb avatar May 30 '24 13:05 alamb

I updated went and updated @jmhain 's PR https://github.com/sqlparser-rs/sqlparser-rs/pull/1290 to support the stride argument https://github.com/sqlparser-rs/sqlparser-rs/pull/1290#issuecomment-2139994033

I then pushed a commit to this branch to rework how field access is planned. Let's see how the CI does

alamb avatar May 30 '24 16:05 alamb

🎉 the tests are now passing

alamb avatar May 30 '24 17:05 alamb

@alamb Thank you and cool! I'll update this PR once sqlparser released a 0.47.

tisonkun avatar May 30 '24 23:05 tisonkun

I tried to update this PR to https://github.com/sqlparser-rs/sqlparser-rs/commit/afa5f08db9b1f3a4805f21fea6b1e72710cdb138 (the latest and greatest commit from sqlparser-rs) but the recent rework of the function definition seem to cause issues. I need to look at that again in more detail

alamb avatar May 31 '24 21:05 alamb

@alamb there is even a new breaking change at https://github.com/sqlparser-rs/sqlparser-rs/pull/1272:

error[E0277]: the trait bound `Arc<str>: From<Box<sqlparser::ast::Expr>>` is not satisfied
   --> datafusion/sql/src/expr/mod.rs:578:74
    |
578 |                 DataType::Timestamp(TimeUnit::Nanosecond, Some(time_zone.into())),
    |                                                                          ^^^^ the trait `From<Box<sqlparser::ast::Expr>>` is not implemented for `Arc<str>`, which is required by `Box<sqlparser::ast::Expr>: Into<_>`
    |
    = help: the following other types implement trait `From<T>`:
              <Arc<str> as From<&str>>
              <Arc<str> as From<std::string::String>>
    = note: required for `Box<sqlparser::ast::Expr>` to implement `Into<Arc<str>>`

I'm worried about downstream projects to upgrade datafusion now 🤣

tisonkun avatar Jun 01 '24 07:06 tisonkun

Pushed some commit to catch up new sqlparser-rs commits. Let's see.

tisonkun avatar Jun 01 '24 07:06 tisonkun

Since we're on the sqlparser-rs main branch now, I'd consider this patch as ready to merge.

But of course it's better to have a new release to sqlparser-rs and we move to that version. Hopefully it contains what it has now without new breaking changes :P

tisonkun avatar Jun 01 '24 07:06 tisonkun

Window specific test failed. I don't have a windows env to debug ...

test cli_quick_test::case_4_set_batch_size ... FAILED

failures:

---- cli_quick_test::case_4_set_batch_size stdout ----
thread 'cli_quick_test::case_4_set_batch_size' panicked at /rustc/9b00956e56009bab2aa15d7bff10916599e3d6d6\library\core\src\ops\function.rs:250:5:
Unexpected stdout, failed var == "[{\"name\":\"datafusion.execution.batch_size\",\"value\":\"1\"}]\n"
├── var: ""
└── var as str: 

command=`"D:\\a\\datafusion\\datafusion\\datafusion-cli\\target\\debug\\datafusion-cli.exe" "--command" "show datafusion.execution.batch_size" "--format" "json" "-q" "-b" "1"`
code=-1073741571
stdout=""
stderr=```

thread \'main\' has overflowed its stack

tisonkun avatar Jun 01 '24 08:06 tisonkun

Since we're on the sqlparser-rs main branch now, I'd consider this patch as ready to merge.

But of course it's better to have a new release to sqlparser-rs and we move to that version. Hopefully it contains what it has now without new breaking changes :P

Thank you so much @tisonkun -- I agree

I will go make a sqlparser release now without any new changes

alamb avatar Jun 01 '24 09:06 alamb

Thanks @tisonkun -- I updated this PR to use the newly released 0.47.0 version of sqlparser-rs and updated the description. I suppose now we just need to debug the windows test (I have an idea) and we'll be good to go

alamb avatar Jun 01 '24 10:06 alamb

My plan for this PR is to disable the failing test on windows and file a ticket to investigate

alamb avatar Jun 03 '24 13:06 alamb

@alamb Thanks for your review and follow-up! @jmhain Thanks for your help :D

tisonkun avatar Jun 04 '24 22:06 tisonkun

I did a final merge up from main to make sure we don't have any logical conflicts and then I plan to merge this PR

alamb avatar Jun 06 '24 12:06 alamb