ODBC.jl icon indicating copy to clipboard operation
ODBC.jl copied to clipboard

Changes in encoding

Open FinPl opened this issue 5 years ago • 3 comments

Hello,

Is there a way to automatically decode data using cp1252?. I am connecting to both Oracle and Access and they are both sending cp1252 encoded strings. For now I am using the Convert() function in Oracle, Access on the other hand was working fine before I switch to the new refactored ODBC.jl. I know I can use StringsEncodings.jl to decode the strings but I was wondering if there was an option to tell the API which encoding to expect from server.

Any help would be appreciated,

Best regards

FinPl avatar Aug 23 '20 22:08 FinPl

Sorry for the slow response here; yeah, we've had a number of issues over the years w/ encodings, mostly bugs, so in the current 1.0 version, it relies heavily on all strings being UTF8. I'd have to dig in to see if there's a way we could label strings coming out as a certain encoding. Part of the issue is that String in Julia expects its bytes to be UTF8, though it allows invalid UTF8 and you can convert. Could you explain a little more what you're doing? Like, what does your query look like? I'll have ot dig into where we do the string processing and see if there's a way to "hook" into that to bypass converting.

quinnj avatar Oct 08 '20 22:10 quinnj

Thanks for the reply!

The queries I am using are as simple as they can be, everything is returned as expected except for strings. I have used pyodbc since then to retrieve data from the databases. It looks like the magic is happening here: pyodbc src

That's the best insight I can give right now as I am not too familiar with ODBC specifications or Julia's string handling mechanisms.

I am still looking for a way to get rid of python dependencies in my julia scripts.

FinPl avatar Nov 06 '20 15:11 FinPl

Hello again,

I think there are multiple occasions in the code where the conversion might go wrong if the source encoding is somewhat similar to utf-8.

The API uses the transcode() for column names ( str function ) and the dbInterface assumes the input is encoded in utf-8 or a general julian String (the jlcast function ). Finally the function cwstring also relies on transcode() for inputs.

From what I gather, I think the python pyodbc package aforementioned uses the iconv library to overcome encoding specifics.

There might be a way to hook into the StringEncodings.jl package wich already created the iconv bindings. This would allow a proper handling of specific local encodings before they are delivered to julia in the UTF-8 format.

I guess there is a cost here to do the transcoding part but it might be an acceptable trade-off for those who want to play with strings and databases.

FinPl avatar Mar 26 '21 21:03 FinPl