turbodbc icon indicating copy to clipboard operation
turbodbc copied to clipboard

[MSSQL] encoding/decoding format change for reading data

Open llanka opened this issue 7 years ago • 7 comments

Hi,

After I get the data from cursor.fetchallnumpy(), I'm unable to write to a file using 'latin-1' encoding. Only 'utf-8' is supported here. When I used pyodbc, I have the option of setting encoding/decoding format to the connection like

conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin-1')
conn.setencoding(encoding='latin-1')

I could not find anything similar to this in turbodbc.

Thanks

llanka avatar Oct 30 '18 14:10 llanka

Hi there! With turbodbc and MSSQL, you may want to set the prefer_unicode option. This should generate UCS-2 encoded Python strings. If you want to write these to a file, give the encoding parameter of Python's open function a spin:

with open('test', encoding='utf-8', mode='w') as f:
    ...

(see Python documentation)

MathMagique avatar Nov 07 '18 12:11 MathMagique

Hi, thanks for the reply! I've already used the encoding while writing to a file and also setting prefer_unicode problem. My issue is I can write the data using utf-8, but I need to write them using latin-1 and as I mentioned above it's possible in pyodbc.

llanka avatar Nov 12 '18 06:11 llanka

Could you provide the code you are using, please? The one with turbodbc in it?

MathMagique avatar Nov 12 '18 08:11 MathMagique

Sure,

import pandas as pd
import csv
import numpy
import turbodbc

options=turbodbc.make_options(
    read_buffer_size=turbodbc.Rows(100),
    parameter_sets_to_buffer=5000,
    use_async_io=True,
    prefer_unicode=True,
    large_decimals_as_64_bit_types=True)

conn = turbodbc.connect(connection_string=conn_mssql_str, turbodbc_options=options)
cursor = conn.cursor()

query_str = 'select * from aaa'
cursor.execute(query_str)
result = cursor.fetchallnumpy()
result_df = pd.DataFrame(result)

result_df.to_csv(ExtraFilesPath + 'aaa.txt', sep='|', index=False, 
quoting=csv.QUOTE_ALL, encoding='latin-1')

llanka avatar Nov 12 '18 09:11 llanka

Sorry to bother you again, do you have any solution or update for me? Thanks

llanka avatar Nov 23 '18 13:11 llanka

Hm, looks reasonable enough to me. Have you verified the dataframe itself contains the correct data? Have you checked what happens if you leave out the encoding parameter? Are there any differences in the output file, then? I am also not sure if latin-1 is the proper writing, the Python docs seem to favour latin_1. Just fishing for clues.

MathMagique avatar Nov 26 '18 15:11 MathMagique

If I omit the encoding, it defaults to utf-8 encoding in python3. Actually, the above code fails at the to_csv part if there are some special characters because I've used latin-1. The issue is MSSQL collation is 'SQL_Latin1_General_CP1_CI_AS' for text columns and by default, python3 is converting those special characters to utf-8. Pyodbc has the option of mentioning latin-1 option explicitly while connecting to the database to encode/decode them in latin-1. But my requirement here is to be able to write the data using latin-1, not utf-8.

llanka avatar Dec 03 '18 13:12 llanka