odbc icon indicating copy to clipboard operation
odbc copied to clipboard

Avoid querying column types from the database in dbWriteTable() and dbAppendTable()

Open krlmlr opened this issue 4 years ago • 9 comments

Issue Description and Expected Result

dbWriteTable() on SQL Server 12.xx fails with BLOB columns if the table name is given in quotes. (I spent too much time isolating this.)

I suspect we can't make it work with quotes, this must be some SQL server quirk. Unfortunately, this also means that dplyr::copy_to() fails in these cases. The internet equivocally suggests using INSERT INTO xxx VALUES (CONVERT(varbinary(max), ?)) .

Should we introduce a new argument to dbWriteTable() and dbAppendTable() that supports custom conversions for the input? Are there other use cases besides this particular problem?

Database

SQL Server 12.xx (doesn't happen on SQL Server 15.xx on Ubuntu).

Reproducible Example

library(DBI)

drv <- odbc::odbc()

uid <- Sys.getenv("...")
pwd <- Sys.getenv("...")
con <- dbConnect(
  drv,
  dsn = "dwh",
  uid = uid, pwd = pwd, encoding = "utf-8"
)

con
#> <OdbcConnection> SV0118
#>   Database: master
#>   Microsoft SQL Server Version: 12.00.6164

data <- data.frame(a = 1:3)
data$b <- blob::as_blob(lapply(lapply(1:3, runif), serialize, connection = NULL))

dbWriteTable(con, "blob_test", data)
identical(dbReadTable(con, "blob_test"), data)
#> [1] TRUE
dbRemoveTable(con, "blob_test")

dbWriteTable(con, SQL("[blob_test]"), data)
#> Error in result_insert_dataframe(rs@ptr, values, batch_rows): nanodbc/nanodbc.cpp:1655: 00000: [FreeTDS][SQL Server]Implicit conversion from data type varchar to varbinary is not allowed. Use the CONVERT function to run this query.  [FreeTDS][SQL Server]Statement(s) could not be prepared.
identical(dbReadTable(con, "blob_test"), data)
#> [1] FALSE
dbRemoveTable(con, "blob_test")

Created on 2021-06-11 by the reprex package (v2.0.0)

krlmlr avatar Jun 11 '21 04:06 krlmlr

Hey @krlmlr

Is this FreeTDS specific? Everything below is assuming that the answer to above is YES - feel free to ignore otherwise.

The part that may be failing is us trying to send-ahead to the server a description of the column types in table (this is needed because the ODBC API implementation in FreeTDS is incomplete):

https://github.com/r-dbi/odbc/blob/main/R/Table.R#L70

In particular, I would look at how odbcConnectionColumns(conn, name) behaves when name is quoted as in your example. If indeed, that's the piece that's failing (should be easy to verify with your code), then the solution should be straightforward - either here or in dplyr.

Cheers

detule avatar Jun 26 '21 16:06 detule

Yes, it's FreeTDS. I'm not sure what a dbplyr solution looks like -- column names can be reserved words. Can you propose a solution for the odbc package?

krlmlr avatar Jun 26 '21 18:06 krlmlr

Heya @krlmlr

I think it would be useful to confirm whether the issue is related to odbcConnectionColumns not giving you a coherent result when A. the table exists, and B. the (table) name argument is quoted as in your example.

If so - then the next step would be to test whether fixing odbcConnectionColumns (it's an S4 method, so would need to write an implementation that covers the quoted table name argument case), fixes the overall issue. If yes, then we can roll that fix into odbc, I would think.

Cheers

detule avatar Jun 27 '21 16:06 detule

Coming back here, I don't understand why dbWriteTable() calls odbcConnectionColumns() in the first place. I see that we need four columns from the output to call describe_parameters() for optimal input conversion, but I don't see why we can't get this data from the input itself.

@jimhester, anyone: Could you please shed some light on the reason? I'd rather avoid the brittleness of calling odbcConnectionColumns() in dbWriteTable(), but there might be good reasons I'm not aware of yet.

krlmlr avatar Aug 24 '22 12:08 krlmlr

That behavior was added in https://github.com/r-dbi/odbc/pull/313, @detule would likely be the best person to ask.

jimhester avatar Aug 24 '22 12:08 jimhester

I think with odbcConnectionColumns + describe_parameters we fixed an issue with VARCHAR columns being truncated on INSERT with FreeTDS. From what I can tell, SQLDescribeParams is still not implemented in FreeTDS, so likely not much would have changed with that particular challenge since then.

I don't follow how we can get column descriptions from the input, but that sounds interesting.

detule avatar Aug 24 '22 13:08 detule

Thanks.

Specifically, we need:

  • data_type
  • column_size
  • decimal_digits

for each input parameter. All these values are integers, data_type seems to be the ODBC-specific type coding, see e.g. https://www.easysoft.com/developer/languages/c/examples/ListDataTypes.html .

With #313 dbWriteTable() attempts to get these from the catalog information on the database, which can fail in many different ways. There's no fallback. I'm proposing to infer these values from the data instead. We would need a mapping from R type to ODBC type, this already exists in the reverse direction in odbc_result::column_types() . The column_size and decimal_digits would also be inferred from the data.

krlmlr avatar Aug 24 '22 13:08 krlmlr

Interesting @krlmlr

For variable sized columns, would you then iterate through all elements to get a sense of the maximum field length? There might be some performance trade offs there.

Back to your original issue - I am still not sure we have a good sense of what is happening.

  • Does the table exist before the first write call? What is the table description? After the first write but before the first remove, what is the output of odbcConnectionColumns with/and/widhout quoting the table name?
  • After the second write but before the second remove, what is the output of odbcConnection columns with/and/without quoting the table name.

Cheers

detule avatar Aug 24 '22 14:08 detule

We're iterating through all elements anyway when writing a table or iterating over all parameters.

For the original issue, I'm convinced that odbcConnectionColumns() is too brittle. For instance, it doesn't seem to help with temporary tables on SQL Server. Which is why I'd like to avoid this altogether.

I can open a new issue and close this.

krlmlr avatar Aug 24 '22 14:08 krlmlr

Hi @krlmlr

Would like to wrap up this issue, and to do it if possible, with a light touch. Ultimately, the root cause for some of these issues is the incomplete ODBC API implementation in FreeTDS and if we can avoid adding complexity to odbc to compensate for that deficiency in FreeTDS, I would count that as a win.

Can you see if this solution works for the issue / example in the OP? If so, I'll clean it up / add some unit tests and see if we can get it merged.

Copy that there may be other issues with SQL Server and temporary tables; I believe we can address them elsewhere ( similarly with a light touch ).

Thanks

detule avatar Dec 16 '22 02:12 detule

ping @krlmlr

Had a chance to test the reprex in the OP against SQL Server with FreeTDS and seems like we can move past it:

> dbRemoveTable(conn, "blob_test")
> identical(DBI::dbReadTable(conn, "blob_test"), data)
[1] TRUE
> dbRemoveTable(conn, "blob_test")
> dbWriteTable(conn, DBI::SQL("[blob_test]"), data)
> identical(DBI::dbReadTable(conn, "blob_test"), data)
[1] TRUE
> dbGetInfo(conn)$drivername
[1] "libtdsodbc.so"
> dbGetInfo(conn)$driver.version
[1] "01.01.9999"

Let me know if you see the same.

Thanks

detule avatar Dec 18 '22 15:12 detule

Thanks. I still don't understand why we need to query column types from the database before writing. But if we must, the proposed change looks good.

krlmlr avatar Dec 18 '22 17:12 krlmlr

Thanks @krlmlr

Appreciate it.

FWIW I don't think we have to go the route of getting table / column information from the back-end; however I do think it saves us from needing to do more parsing of the input data, and/or dipping our toes into back-end data/type definitions. Either way, IMO that would be on-boarding more complexity into package:odbc.

Let us pull that lever if we find that we really need to.

detule avatar Dec 28 '22 12:12 detule

From experience with other backends, "parsing of input data" or "back-end data/type definitions" were never an issue. I don't know enough about ODBC, though -- could we map each R data type to the same ODBC data representation? If this is possible, I believe this would reduce complexity in ODBC.

krlmlr avatar Jan 08 '23 13:01 krlmlr