odbc
odbc copied to clipboard
Data gets truncated when special character example "ü" found while using Snowflake ODBC drivers with RStudio/R Terminal on Ubuntu 18.4/Centos 7.6/MacOS
Issue Description and Expected Result
Data gets truncated when special character example "ü" found while using Snowflake ODBC drivers with RStudio/R Terminal on Ubuntu 18.4/Centos 7.6/MacOS.
When accessing data via R-DBI via Snowflake ODBC driver the column "A" gets truncated (from Snowflake UI we can see that varchar(4) is returned as dataytpe and that result is complete). This issue is happening on Ubuntu 18.4/Centos 7.6/MacOS while using Snowflake ODBC drivers with RStudio/R Terminal.
As per the ODBC driver Manager log, both the column A and B return the correct data and its length in bytes, So this proves that the Snowflake ODBC driver is working as expected. We suspect the data truncation is happening at the R-Studio Application layer which is unable to display the data correctly.
ODBC driver Manager log:
[ODBC][23220][1587406470.655413][SQLGetData.c][233]
Entry:
Statement = 0x98f720
Column Number = 1
Target Type = 1 SQL_CHAR
Buffer Length = 301
Target Value = 0x7ffef14bacf0
StrLen Or Ind = 0x7ffef14baba0
[ODBC][23220][1587406470.655542][SQLGetData.c][497]
Exit:[SQL_SUCCESS]
Buffer = [grün]
Strlen Or Ind = 0x7ffef14baba0 -> 5
...
[ODBC][23220][1587406470.655694][SQLGetData.c][233]
Entry:
Statement = 0x98f720
Column Number = 2
Target Type = 1 SQL_CHAR
Buffer Length = 301
Target Value = 0x7ffef14bacf0
StrLen Or Ind = 0x7ffef14baba0
[ODBC][23220][1587406470.655756][SQLGetData.c][497]
Exit:[SQL_SUCCESS]
Buffer = [grün]
Strlen Or Ind = 0x7ffef14baba0 -> 5
We performed another test using the python code "pyodbc" and tried to connect the ODBC DSN and we see data is coming properly. So It tells us the issue is not related with the Snowflake ODBC driver. Looks like this issue lies on the R-Studio/R Terminal. Can you please advise here ?
Note : This issue is not seen while using Snowflake ODBC driver with Rstudio/R terminal on Windows 10 and with Snowsql client or Snowflake GUI.
Database
Create trial Snowflake account : https://trial.snowflake.com Download Snowflake ODBC Driver "2.21.3" : https://sfc-repo.snowflakecomputing.com/odbc/index.html Create ODBC DSN on Ubunutu 18.4 : https://docs.snowflake.com/en/user-guide/odbc-linux.html
Reproducible Example
Code used in R-Studio :
install.packages(c("DBI", "dplyr","dbplyr","odbc"))
library(DBI)
library(dplyr)
library(dbplyr)
library(odbc)
myconn <- DBI::dbConnect(odbc::odbc(),DBMSencoding = "UTF-8", "SNOWFLAKE_ODBC", uid="test", pwd='abcd@123')
mydata <- DBI::dbGetQuery(myconn,"SELECT 'grün' AS A,'grün'::text AS B")
head(mydata)
lapply(mydata, class)
odbc::odbcListDrivers()
odbc::odbcListDataSources()
sessionInfo()
Output: (Column A value gets truncated)
> head(mydata)
A B
1 grü grün
> lapply(mydata, class)
$A
[1] "character"
$B
[1] "character"
We verified that the Snowflake ODBC driver is working fine with Python Code on Ubuntu 18.4/Centos 7.6/MacOS.
To install pyodbc on system run below command.
> pip install pyodbc
Save the below code into a file name example "snowflakepyodbc.py" and try to run it .
import pyodbc
import sys
import os
con = pyodbc.connect('DSN=SNOWFLAKE_ODBC_DSN_NAME;UID=user1;PWD=Abc@123')
con.setencoding(encoding='utf-8')
con.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cursor=con.cursor()
cursor.execute("SELECT 'grün' AS A, 'grün'::text AS B")
while True:
row=cursor.fetchone()
if not row:
break
print(row)
Python Output: (Correct output)
A | B
grün | grün
Session Info
devtools::session_info()
#> output
It is most likely a Unicode character width mismatch between what R / odbc is expecting and what the snowflake driver is sending.
I doubt I will have time to reproduce this anytime in the near future, I don't really have the bandwidth to stand up a new database type.
As a note, when trying this with RODBC, it also delivers the correct result. It would be great if you could have a look, as Snowflake is getting quite popular these days and this looks like a DBI/odbc issue.
con_rodbc = RODBC::odbcDriverConnect(connection = glue::glue("driver=SnowflakeDSIIDriver;server={account}.snowflakecomputing.com;uid={user};pwd={password};warehouse={warehouse};role={role}"))
RODBC::sqlQuery(con_rodbc, "select 'grün' as A, 'grün'::text as B") A B 1 grün grün
Encountering quite similar problem when the type varchar(1)
of a column works perfecly on other applications (varchar(1)
is automatically specified by the compiler (not explicitly by the user) so it is should be supported by the official driver). But fails only here. Highly doubt that there is some unintended behavior when parsing the results given the type varchar
.
Hope this can be solved in the future.
odbc::dbGetQuery(
con, # snowflake odbc connection
"SELECT
'日'::varchar(1) AS dow_var1,
'日'::varchar(2) AS dow_var2,
'日'::varchar(3) AS dow_var3,
'日'::text AS dow_text
;"
)
# DOW_VAR1 DOW_VAR2 DOW_VAR3 DOW_TEXT
# 1 \xe6 \xe6\x97 日 日
After some tests by myself, I found that
日
, which is saved as varchar(1)
, now requires varchar(3)
type to be parsed correctly by this library.
And two chracsters, e.g. 日月
, need varchar(6)
to work here.
Mystery deepens...
Encountering quite similar problem when the type
varchar(1)
of a column works perfecly on other applications (varchar(1)
is automatically specified by the compiler (not explicitly by the user) so it is should be supported by the official driver). But fails only here. Highly doubt that there is some unintended behavior when parsing the results given the typevarchar
.Hope this can be solved in the future.
odbc::dbGetQuery( con, # snowflake odbc connection "SELECT '日'::varchar(1) AS dow_var1, '日'::varchar(2) AS dow_var2, '日'::varchar(3) AS dow_var3, '日'::text AS dow_text ;" ) # DOW_VAR1 DOW_VAR2 DOW_VAR3 DOW_TEXT # 1 \xe6 \xe6\x97 日 日
After some tests by myself, I found that
日
, which is saved asvarchar(1)
, now requiresvarchar(3)
type to be parsed correctly by this library. And two chracsters, e.g.日月
, needvarchar(6)
to work here.Mystery deepens...
related to #437
Thanks for the comment! I checked the thread #437 which kind of supported my findings on the snowflake official documents.
When you declare a column of type VARCHAR, you can specify an optional parameter (N), which is the maximum number of characters to store
The descrepency on recognizing the N
unit as characters
or bytes
is what caused the problem.
When a character actually contains multiple bytes, this happens.
UPDATE:
According to official documents,
adding MapToLongVarchar=0
to simba.snowflake.ini
will map all string data types to SQL_LONGVARCHAR
.
I tested this on the linux, and it solved the problem in the thread.
Though it's still unknown that:
- the performance loss. Given the document
Performance: There is no performance difference between using the full-length VARCHAR declaration VARCHAR(16777216) and a smaller length.
I'd assume there is no performance loss. But without a real performance test I couldn't be sure of this.
- what exactly is this
SQL_LONGVARCHAR
type?
I searched for a while and couldn't find it anywhere. I'd assume it to be TEXT
type, but no evidence.