odbc icon indicating copy to clipboard operation
odbc copied to clipboard

Data gets truncated when special character example "ü" found while using Snowflake ODBC drivers with RStudio/R Terminal on Ubuntu 18.4/Centos 7.6/MacOS

Open krishnavns opened this issue 4 years ago • 2 comments

Issue Description and Expected Result

Data gets truncated when special character example "ü" found while using Snowflake ODBC drivers with RStudio/R Terminal on Ubuntu 18.4/Centos 7.6/MacOS.

When accessing data via R-DBI via Snowflake ODBC driver the column "A" gets truncated (from Snowflake UI we can see that varchar(4) is returned as dataytpe and that result is complete). This issue is happening on Ubuntu 18.4/Centos 7.6/MacOS while using Snowflake ODBC drivers with RStudio/R Terminal.

As per the ODBC driver Manager log, both the column A and B return the correct data and its length in bytes, So this proves that the Snowflake ODBC driver is working as expected. We suspect the data truncation is happening at the R-Studio Application layer which is unable to display the data correctly.

ODBC driver Manager log:

[ODBC][23220][1587406470.655413][SQLGetData.c][233]
		Entry:            
			Statement = 0x98f720            
			Column Number = 1            
			Target Type = 1 SQL_CHAR            
			Buffer Length = 301            
			Target Value = 0x7ffef14bacf0            
			StrLen Or Ind = 0x7ffef14baba0
[ODBC][23220][1587406470.655542][SQLGetData.c][497]
		Exit:[SQL_SUCCESS]                
			Buffer = [grün]                
			Strlen Or Ind = 0x7ffef14baba0 -> 5
			...
[ODBC][23220][1587406470.655694][SQLGetData.c][233]
		Entry:            
			Statement = 0x98f720            
			Column Number = 2            
			Target Type = 1 SQL_CHAR            
			Buffer Length = 301            
			Target Value = 0x7ffef14bacf0            
			StrLen Or Ind = 0x7ffef14baba0
[ODBC][23220][1587406470.655756][SQLGetData.c][497]
		Exit:[SQL_SUCCESS]                
			Buffer = [grün]                
			Strlen Or Ind = 0x7ffef14baba0 -> 5

We performed another test using the python code "pyodbc" and tried to connect the ODBC DSN and we see data is coming properly. So It tells us the issue is not related with the Snowflake ODBC driver. Looks like this issue lies on the R-Studio/R Terminal. Can you please advise here ?

Note : This issue is not seen while using Snowflake ODBC driver with Rstudio/R terminal on Windows 10 and with Snowsql client or Snowflake GUI.

Database

Create trial Snowflake account : https://trial.snowflake.com Download Snowflake ODBC Driver "2.21.3" : https://sfc-repo.snowflakecomputing.com/odbc/index.html Create ODBC DSN on Ubunutu 18.4 : https://docs.snowflake.com/en/user-guide/odbc-linux.html

Reproducible Example

Code used in R-Studio :

install.packages(c("DBI", "dplyr","dbplyr","odbc"))
library(DBI)
library(dplyr)
library(dbplyr)
library(odbc)
myconn <- DBI::dbConnect(odbc::odbc(),DBMSencoding = "UTF-8", "SNOWFLAKE_ODBC", uid="test", pwd='abcd@123')
mydata <- DBI::dbGetQuery(myconn,"SELECT 'grün' AS A,'grün'::text AS B")
head(mydata)
lapply(mydata, class)
odbc::odbcListDrivers()
odbc::odbcListDataSources()
sessionInfo()

Output: (Column A value gets truncated)

> head(mydata)

   A    B

1 grü  grün

> lapply(mydata, class)
$A
[1] "character"

$B
[1] "character"

We verified that the Snowflake ODBC driver is working fine with Python Code on Ubuntu 18.4/Centos 7.6/MacOS.

To install pyodbc on system run below command.

> pip install pyodbc

Save the below code into a file name example "snowflakepyodbc.py" and try to run it .

import pyodbc
import sys
import os

con = pyodbc.connect('DSN=SNOWFLAKE_ODBC_DSN_NAME;UID=user1;PWD=Abc@123')
con.setencoding(encoding='utf-8')
con.setdecoding(pyodbc.SQL_CHAR, encoding='utf-8')
cursor=con.cursor()
cursor.execute("SELECT 'grün' AS A, 'grün'::text AS B")

while True:
        row=cursor.fetchone()                                                                                                                                               
        if not row:                                                                                                                                                         
                break                                                                                                                                                       
        print(row)

Python Output: (Correct output)

A | B
grün | grün
Session Info
devtools::session_info()
#> output

krishnavns avatar Jun 04 '20 08:06 krishnavns

It is most likely a Unicode character width mismatch between what R / odbc is expecting and what the snowflake driver is sending.

I doubt I will have time to reproduce this anytime in the near future, I don't really have the bandwidth to stand up a new database type.

jimhester avatar Jun 05 '20 14:06 jimhester

As a note, when trying this with RODBC, it also delivers the correct result. It would be great if you could have a look, as Snowflake is getting quite popular these days and this looks like a DBI/odbc issue.

con_rodbc = RODBC::odbcDriverConnect(connection = glue::glue("driver=SnowflakeDSIIDriver;server={account}.snowflakecomputing.com;uid={user};pwd={password};warehouse={warehouse};role={role}"))

RODBC::sqlQuery(con_rodbc, "select 'grün' as A, 'grün'::text as B") A B 1 grün grün

euklid321 avatar Jun 22 '20 08:06 euklid321

Encountering quite similar problem when the type varchar(1) of a column works perfecly on other applications (varchar(1) is automatically specified by the compiler (not explicitly by the user) so it is should be supported by the official driver). But fails only here. Highly doubt that there is some unintended behavior when parsing the results given the type varchar.

Hope this can be solved in the future.

odbc::dbGetQuery(
  con,  # snowflake odbc connection
  "SELECT 
    '日'::varchar(1) AS dow_var1, 
    '日'::varchar(2) AS dow_var2,
    '日'::varchar(3) AS dow_var3,
    '日'::text AS dow_text
  ;"
)

#  DOW_VAR1 DOW_VAR2 DOW_VAR3 DOW_TEXT
# 1     \xe6 \xe6\x97       日       日

After some tests by myself, I found that , which is saved as varchar(1), now requires varchar(3) type to be parsed correctly by this library. And two chracsters, e.g. 日月, need varchar(6) to work here.

Mystery deepens...

yuhenghuang avatar Jan 04 '23 07:01 yuhenghuang

Encountering quite similar problem when the type varchar(1) of a column works perfecly on other applications (varchar(1) is automatically specified by the compiler (not explicitly by the user) so it is should be supported by the official driver). But fails only here. Highly doubt that there is some unintended behavior when parsing the results given the type varchar.

Hope this can be solved in the future.

odbc::dbGetQuery(
  con,  # snowflake odbc connection
  "SELECT 
    '日'::varchar(1) AS dow_var1, 
    '日'::varchar(2) AS dow_var2,
    '日'::varchar(3) AS dow_var3,
    '日'::text AS dow_text
  ;"
)

#  DOW_VAR1 DOW_VAR2 DOW_VAR3 DOW_TEXT
# 1     \xe6 \xe6\x97       日       日

After some tests by myself, I found that , which is saved as varchar(1), now requires varchar(3) type to be parsed correctly by this library. And two chracsters, e.g. 日月, need varchar(6) to work here.

Mystery deepens...

related to #437

shrektan avatar Jan 04 '23 14:01 shrektan

Thanks for the comment! I checked the thread #437 which kind of supported my findings on the snowflake official documents.

When you declare a column of type VARCHAR, you can specify an optional parameter (N), which is the maximum number of characters to store

The descrepency on recognizing the N unit as characters or bytes is what caused the problem. When a character actually contains multiple bytes, this happens.

UPDATE: According to official documents, adding MapToLongVarchar=0 to simba.snowflake.ini will map all string data types to SQL_LONGVARCHAR.

I tested this on the linux, and it solved the problem in the thread.

Though it's still unknown that:

  1. the performance loss. Given the document

Performance: There is no performance difference between using the full-length VARCHAR declaration VARCHAR(16777216) and a smaller length.

I'd assume there is no performance loss. But without a real performance test I couldn't be sure of this.

  1. what exactly is this SQL_LONGVARCHAR type?

I searched for a while and couldn't find it anywhere. I'd assume it to be TEXT type, but no evidence.

yuhenghuang avatar Jan 05 '23 01:01 yuhenghuang