turbodbc
turbodbc copied to clipboard
[MSSQL] encoding/decoding format change for reading data
Hi,
After I get the data from cursor.fetchallnumpy(), I'm unable to write to a file using 'latin-1' encoding. Only 'utf-8' is supported here. When I used pyodbc, I have the option of setting encoding/decoding format to the connection like
conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin-1')
conn.setencoding(encoding='latin-1')
I could not find anything similar to this in turbodbc.
Thanks
Hi there! With turbodbc and MSSQL, you may want to set the prefer_unicode option. This should generate UCS-2 encoded Python strings. If you want to write these to a file, give the encoding parameter of Python's open function a spin:
with open('test', encoding='utf-8', mode='w') as f:
...
(see Python documentation)
Hi, thanks for the reply! I've already used the encoding while writing to a file and also setting prefer_unicode problem. My issue is I can write the data using utf-8, but I need to write them using latin-1 and as I mentioned above it's possible in pyodbc.
Could you provide the code you are using, please? The one with turbodbc in it?
Sure,
import pandas as pd
import csv
import numpy
import turbodbc
options=turbodbc.make_options(
read_buffer_size=turbodbc.Rows(100),
parameter_sets_to_buffer=5000,
use_async_io=True,
prefer_unicode=True,
large_decimals_as_64_bit_types=True)
conn = turbodbc.connect(connection_string=conn_mssql_str, turbodbc_options=options)
cursor = conn.cursor()
query_str = 'select * from aaa'
cursor.execute(query_str)
result = cursor.fetchallnumpy()
result_df = pd.DataFrame(result)
result_df.to_csv(ExtraFilesPath + 'aaa.txt', sep='|', index=False,
quoting=csv.QUOTE_ALL, encoding='latin-1')
Sorry to bother you again, do you have any solution or update for me? Thanks
Hm, looks reasonable enough to me. Have you verified the dataframe itself contains the correct data? Have you checked what happens if you leave out the encoding parameter? Are there any differences in the output file, then? I am also not sure if latin-1 is the proper writing, the Python docs seem to favour latin_1. Just fishing for clues.
If I omit the encoding, it defaults to utf-8 encoding in python3. Actually, the above code fails at the to_csv part if there are some special characters because I've used latin-1. The issue is MSSQL collation is 'SQL_Latin1_General_CP1_CI_AS' for text columns and by default, python3 is converting those special characters to utf-8. Pyodbc has the option of mentioning latin-1 option explicitly while connecting to the database to encode/decode them in latin-1. But my requirement here is to be able to write the data using latin-1, not utf-8.