odbc icon indicating copy to clipboard operation
odbc copied to clipboard

nvarchar character encoding in nettezza db

Open andreaspano opened this issue 8 years ago • 20 comments

Issue Description and Expected Result

When quering nettezza on nvarchar fields, returns strange characters instead of plain UTF-8. The issue occurs only on nvarchar. varchar, date_time and numeric work correctly

Database

IBM Netezza

Reproducible Example

library(odbc)
  con <- dbConnect(odbc::odbc(),
                   driver = "NetezzaSQL",
                   database = "NYCFLIGHTS",
                   uid = "user",
                   pwd = "user",
                   server = "192.168.69.xxx",
                   port = 5480,
                   encoding = 'UTF-8')

dbGetQuery(con, 'select TAILNUM, SEATS from planes limit 3')

  TAILNUM SEATS
1  ㅎ㔱ㄵ    55
2  ㅎㄲ㈴    55
3  ㅎㄲ㌶    55

devtools::session_info()
setting  value                       
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.383)           
 language en                          
 collate  it_IT.UTF-8                 
 tz       Europe/Rome                 
 date     2017-10-25    

andreaspano avatar Oct 25 '17 13:10 andreaspano

The encoding parameter needs to be set to the database encoding. It looks like you can query it using nzsql -c "show nz_encoding".

jimhester avatar Oct 25 '17 14:10 jimhester

Already done ... The encoding is UTF-8. I am not using chinese character but this is what I see as output

andreaspano avatar Oct 25 '17 14:10 andreaspano

What are the values returned for this query if you use nzsql directly?

This is likely due to a mismatch in the unicode width between the driver and the database. You may have to set the UnicodeTranslationOption (https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.datacon.doc/c_datacon_odbc_driver_config_unix_linux.html).

See also https://stackoverflow.com/questions/21536059/python-pyodbc-connections-to-ibm-netezza-erroring, which are some answers for a similar problem with pyodbc.

jimhester avatar Oct 25 '17 15:10 jimhester

Thanks to Jim for his support

nzsql works fine

My odbc.ini is set up as described in the first link

I am aware of the cast work around and it works fine, But, when moving forward to testing the Netezza-dplyr interface this is not enough :-(

andreaspano avatar Oct 25 '17 16:10 andreaspano

Did you try using UnicodeTranslationOption = utf16?

jimhester avatar Oct 25 '17 16:10 jimhester

Few months back we encountered the same problem. I opened an RStudio support case for this (21266) but we didn't find a solution. For now we are also casting the values. Which does not scale to many users and many queries... Back then I also noticed the 8 byte vs 16 byte problem but didn't manage to figure out where it went wrong. Note that we did not have the problem with RStudio on Windows. Only with RStudio Server Pro on Linux. I remember trying different values for UnicodeTranslationOption without success. Will give that another try.

bquinart avatar Oct 25 '17 20:10 bquinart

Changing the value for UnicodeTranslationOption does not seem to have any effect. Note that the RODBCDBI package has the correct characters... (for same DSN and query).

bquinart avatar Oct 25 '17 21:10 bquinart

It looks like UnicodeTranslationOption only relates to 'normal' varchar fields and not nvarchar. The only thing I can 'achieve' is that the same 'error in conversion' happens with the regular varchar fields when the driver has "UnicodeTranslationOption=utf16" and the R odbc connection encoding = 'UTF-16'. For all other combination varchar fields are either correct or the decoding of unicode characters fails. The nvarchar output remains constant.

bquinart avatar Oct 25 '17 21:10 bquinart

In conclusion, it seems that nvarchar, with the netezza driver and the odbc package, are not shown correctly. At the moment no one seems to have a 100% solution. Thanks to all for support

andreaspano avatar Oct 29 '17 20:10 andreaspano

We just upgraded our server to RStudio 1.1. Not sure if it is related, but we have the same effect of special characters for the database, schema and table names in the IDE Connections pane for Netezza connections. I guess the metadata queries that fill that screen also return nvarchar fields.

bquinart avatar Nov 03 '17 17:11 bquinart

Another datapoint: Netzza connectivity works perfectly with RODBC:

library(DBI)
library(RODBCDBI)

con <- dbConnect(
  RODBCDBI::ODBC(),
  dsn = NA,
  connection = sprintf(
    "DRIVER={NetezzaSQL};SERVER=%s;DATABASE=%s;ReadOnly=true;UID=%s;PWD=%s",
    "SERVER",
    "DATABASE",
    "USER",
    "PASS"
  ),
  case = 'nochange'
)

randombk avatar Nov 11 '17 01:11 randombk

I see this issue got a label reprex. Does that mean the reprex available is good? Or that it is not reproducible? I suppose it is difficult to work on this if you do not have a Netezza database available. If that is the problem I can look into the dev environment that IBM offers. See here.

bquinart avatar Apr 13 '18 10:04 bquinart

Unfortunately I have no C++ or R package development experience and can not investigate this problem. However, I think I can show with below examples that the problem is that at some point, text that is UTF-8 encoded gets decoded as UTF-16. You can force the same error for the varchar fields. Those fields are decoded according the the parameter that gets passed in the connection string.

con <- dbConnect(odbc::odbc(), dsn= 'Netezza', encoding = 'UTF-8')
dbGetQuery(
  con,
  "SELECT CAST('Hello!' AS VARCHAR(30)) AS SIMPLEFIELD, CAST('Hello!' AS NVARCHAR(30)) AS UNICODEFIELD"
)
  SIMPLEFIELD UNICODEFIELD
1      Hello!        效汬Ⅿ

When we change the encoding parameter to UTF-16, the first field that was correctly decoded earlier now has the same error as the second field. Note that the 6 bytes from the string gets decoded as 3 characters (which makes sense for UTF-16).

con <- dbConnect(odbc::odbc(), dsn= 'Netezza', encoding = 'UTF-16')
dbGetQuery(
  con,
  "SELECT CAST('Hello!' AS VARCHAR(30)) AS SIMPLEFIELD, CAST('Hello!' AS NVARCHAR(30)) AS UNICODEFIELD"
)
  SIMPLEFIELD UNICODEFIELD
1       效汬Ⅿ        效汬Ⅿ

Note that this is with the IBM ODBC drivers. I believe there are also Simba Netezza ODBC drivers. I wonder if the same error happens there. Unfortunately I have no access to those drivers myself.

bquinart avatar Apr 13 '18 12:04 bquinart

In this case it means I need to be able to reproduce it myself, but I cannot because I do not currently have access to a netezza db

jimhester avatar Apr 13 '18 14:04 jimhester

We had this same issue, wanted to share some insights on what we found. If one looks at the bytes, it is an endian problem (UTF-16LE vs UTF-16BE), i.e. the order of the bytes. We were able to duplicate the issue from R in Python. Python has an encode('utf-16le') function that can transpose the bytes and then it was readable data when printed. I'm not sure if it is the driver or odbc package that needs additional modification, but something needs to be able to take an option of UTF-16LE vs UTF-16BE, or something for endianness that would also work with UTF-32, etc.

stewartthomasj avatar Apr 23 '18 21:04 stewartthomasj

The ODBC Specification seems to be that all SQLWCHAR data should be little endian

SQLWCHAR data must be UTF-16LE (Little Endian).

So this seems like a driver issue to me, I am not sure if there is a Netezza driver configuration setting that can change this or not.

You should be able to convert the data after retrieval by using iconv(x, "UTF-16BE", "UTF-8") or similar.

jimhester avatar Jun 08 '18 15:06 jimhester

Thanks for the update. I have a support ticket open with IBM. I wasn't able to get the expected string back with iconv however. Somehow the RODBC package does not have this problem with the same driver.

bquinart avatar Jun 15 '18 12:06 bquinart

Can you try setting the locale as described in the following URL to see if that help? https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.dbu.doc/c_dbuser_specify_encoding.html

jkarioki avatar Aug 24 '18 19:08 jkarioki

No effect. I believe that relates to the nzsql tool and not the odbc connection. As noted above, this only happens with the linux driver - not on windows (against same Netezza appliance). It was noted above that it is likely related to the UTF-16 endianness.

bquinart avatar Aug 24 '18 20:08 bquinart

I had another look at this. I do not believe it is an endian problem. I can retrieve the original text with iconv(field, 'UTF-8','UTF-16LE'). Refer to my comment on 13/04. The data seems to be supplied as UTF-8 and is interpreted at some point as UTF-16LE. I understand the last part would be expected for type SQL_WVARCHAR. Not sure how I could confirm type NVARCHAR is indeed mapped to SQL_WVARCHAR. I wonder if this is working in RODBC, because they somehow still allow to decode SQL_WVARCHAR in other encodings like UTF-8.

bquinart avatar Aug 24 '18 22:08 bquinart