pyodbc icon indicating copy to clipboard operation
pyodbc copied to clipboard

Decimal results incorrect depending on locale

Open gv-collibris opened this issue 5 years ago • 19 comments
trafficstars

Environment

  • Python: 3.6.3
  • pyodbc: 4.0.30
  • OS: WSL, Windows 10, Debian 10
  • DB: SQL Server
  • driver: ODBC Driver 17 for SQL Server

Issue

Numbers (e.g. 3656.880000) from the database are turned to different numbers in Python, depending on the locale, e.g. 3656.88 for an English locale, and 3656880000 for a French one.

Expected behavior

I would expect to always obtain the correct number in the Python code.

gv-collibris avatar Apr 30 '20 09:04 gv-collibris

Please provide a minimal reproducible example.

gordthompson avatar Apr 30 '20 16:04 gordthompson

Hi, Thanks for your reply. Here is a minimal reproducible example. Cheers

gv-collibris avatar May 04 '20 09:05 gv-collibris

Okay, that's weird. I was able to reproduce the issue using your example, but it works fine when I use a simple stand-alone script:

import locale

import pyodbc

if input("English or French? (e/f): ").startswith("f"):
    locale.setlocale(locale.LC_ALL, 'fr_FR.UTF-8')
print(locale.getlocale())

connection_string = (
    "DRIVER=ODBC Driver 17 for SQL Server;"
    "SERVER=192.168.0.179,49242;"
    "DATABASE=myDb;"
    "UID=sa;PWD=_whatever_"
)
cnxn = pyodbc.connect(connection_string)
crsr = cnxn.cursor()
print(crsr.execute("SELECT 1890 * 1.0 / 100").fetchone())

""" console output
English or French? (e/f): e
('en_CA', 'UTF-8')
(Decimal('18.900000'), )

English or French? (e/f): f
('fr_FR', 'UTF-8')
(Decimal('18.900000'), )
"""

gordthompson avatar May 04 '20 12:05 gordthompson

yeah, I've seen that too I struggled a bit to reproduced the issue I have no idea how the use of a package can be different from a script...

gv-collibris avatar May 04 '20 12:05 gv-collibris

Even more mysterious: With your MRE and my script the ODBC driver returns the same string value ...

[ODBC][6457][1588599816.070747][SQLGetData.c][237]
        Entry:
            Statement = 0x2467540
            Column Number = 1
            Target Type = -8 SQL_WCHAR
            Buffer Length = 4096
            Target Value = 0x24593a0
            StrLen Or Ind = 0x7ffe67f0bd10
[ODBC][6457][1588599816.070863][SQLGetData.c][545]
        Exit:[SQL_SUCCESS]                
            Buffer = [18.900000](unicode)                
            Strlen Or Ind = 0x7ffe67f0bd10 -> 18

... but when it pops out of fetchall your MRE has [(Decimal('18900000'), )] while my script has [(Decimal('18.900000'), )], and that holds true whether we use

SELECT 1890 * 1.0 / 100

or

SELECT CAST(1890 * 1.0 / 100 AS decimal(10,6))

However, if I change your MRE to use

SELECT CAST(1890 * 1.0 / 100 AS float)

then the ODBC driver returns ...

[ODBC][6501][1588600241.966768][SQLGetData.c][237]
        Entry:
            Statement = 0x15a7d90
            Column Number = 1
            Target Type = 8 SQL_DOUBLE
            Buffer Length = 8
            Target Value = 0x7ffc7e61ca58
            StrLen Or Ind = 0x7ffc7e61ca60
[ODBC][6501][1588600241.966842][SQLGetData.c][545]
        Exit:[SQL_SUCCESS]                
            Buffer = [18,9]                
            Strlen Or Ind = 0x7ffc7e61ca60 -> 8

... and the (correct) resulting value is [(18.9, )]

gordthompson avatar May 04 '20 13:05 gordthompson

My (uneducated) guess is that the creation of a Decimal uses somehow the Python locale. As we receive 18.900000 from the DB, and as , is the decimal delimiter in French, the dot is ignored altogether. The use of a float do not present this issue, as it's a different data structure, which do not use the locale (I guess).

Initially, I thought I made a mistake in my own code when formatting the numbers, that's where this guess comes from...

gv-collibris avatar May 04 '20 14:05 gv-collibris

Just for clarity, because it might be significant here. In T-SQL the "1.0" literal is not a float, somewhat counter-intuitively, it's a decimal, so multiplying by 1.0 does not automatically generate a float, it typically generates a decimal:

SELECT
SQL_VARIANT_PROPERTY(1.0,'BaseType') AS 'BaseType',
SQL_VARIANT_PROPERTY(1.0,'Precision') AS 'Precision',
SQL_VARIANT_PROPERTY(1.0,'Scale') AS 'Scale'
UNION
SELECT
SQL_VARIANT_PROPERTY(1890 * 1.0,'BaseType') AS 'BaseType',
SQL_VARIANT_PROPERTY(1890 * 1.0,'Precision') AS 'Precision',
SQL_VARIANT_PROPERTY(1890 * 1.0,'Scale') AS 'Scale';

results in:

BaseType    Precision    Scale
numeric        2            1
numeric        7            1

Apologies if you knew this already, but I know it was kind of a surprise to me.

keitherskine avatar May 04 '20 14:05 keitherskine

@gv-collibris - You're probably right; we're just trying to give @v-makouz as much information as we can.

I also tried adding ;Regional=Yes to the end of the connection string in the MRE but that resulted in

pyodbc.Error: ('07006', '[07006] [Microsoft][ODBC Driver 17 for SQL Server]Restricted data type attribute violation (0) (SQLGetData)')

gordthompson avatar May 04 '20 15:05 gordthompson

I think the difference in behavior between Gord's script and the package might be explained by exactly when pyodbc is being imported. In pyodbc, it looks like the numeric locale information is read from the environment when pyodbc is initialized (i.e. imported): https://github.com/mkleehammer/pyodbc/blob/master/src/pyodbcmodule.cpp#L1226 and appears to be fixed after that.

Hence, if you add an import pyodbc to the very beginning of test.py, i.e. before setting the locale, which then emulates Gord's script, then you appear to get a different result. Not sure if this solves anything, but I thought I'd mention it.

keitherskine avatar May 04 '20 15:05 keitherskine

One other thing, if you pyodbc.setDecimalSeparator('.') at the start of test_function() (i.e. set the decimal point as English-style, despite the French locale), it appears you get the right result (18.9).

keitherskine avatar May 04 '20 15:05 keitherskine

@gv-collibris Just for the record, what is the database instance language you are using, and the database collation? If you run the following:

USE <your database>;
SELECT @@language AS instance_lang;
SELECT CONVERT(varchar(256),SERVERPROPERTY('collation')) AS database_collation;

...what do you get?

keitherskine avatar May 04 '20 16:05 keitherskine

Instance language: us_english Database collation: SQL_Latin1_General_CP1_CI_AS

gv-collibris avatar May 04 '20 16:05 gv-collibris

Thanks, @gv-collibris . I thought perhaps you might be using a French-collated database instance but that does not appear to be the case.

Currently, it appears pyodbc reads the decimal point character from the current locale when it is imported (and only once). Personally, I'm not sure whether it should be doing that at all because the decimal point character in the result set is probably added by the SQL Server database engine rather than any C libraries in the server itself (although I'm speculating there).

In the scenario you describe, the workarounds appears to be either:

  • import pyodbc before setting the locale, causing pyodbc to use the default locale (i.e. US English).
  • use pyodbc.setDecimalSeparator('.') to explicitly set the decimal point character to a period before running SQL queries.

I hope that helps.

keitherskine avatar May 04 '20 16:05 keitherskine

I used an other workaround: I round and cast to integer, then divide by 100 in the Python code. It's not particularly an issue, but it is a weird surprise I wanted to avoid to other people ;)

gv-collibris avatar May 05 '20 13:05 gv-collibris

Glad to hear you're not being held up by this. This curious locale behavior was not something I was aware of, so it's good you raised this. Many thanks.

keitherskine avatar May 05 '20 13:05 keitherskine

The reason for this is in getdata.cpp GetDataDecimal function. There it loops through all the characters trying to detect the decimal mark, but it uses the one based on locale for comparison, while ODBC driver always uses "."

One solution I can think of is to replace if (*pch == chDecimal) with if (*pch == chDecimal || *pch == '.')

It looks a little hacky, but it should work for all drivers, whether they use the locale based one or default one.

What do you guys think?

v-makouz avatar May 05 '20 22:05 v-makouz

I can kind of understand why the decimal separator from the Python locale was chosen to be used by pyodbc when parsing decimal values, but it still seems something of an odd choice. First off, I'm making a big assumption that decimal values are generated on the database server as essentially strings, and are sent over the wire more-or-less untouched. So a German-language database might generate a decimal value like "12.345.678,99". This is my big assumption and I'm happy to be corrected on that.

On SQL Server, changing the language of a database (e.g. SET LANGUAGE German) doesn't appear to make any difference to the decimal values in the ODBC trace, they are for example "12345678.99" regardless of the language. Perhaps German-style decimals can be generated by setting the language on the server itself, or by using a different RDBMS like Oracle.

My concern about using the Python locale is that there's no guarantee the app server (running Python) is going to have the same locale as the database server. The database server might be "German" but the app server could be "British". The locales might match, but it's not guaranteed. Hence, it seems something of a stretch to use the app server's locale to figure out what the decimal point character is going to be. By the way, @v-makouz , that's also an argument not to use '.' all the time as per your suggestion. In my German example above, we wouldn't want to use the period character when parsing "12.345.678,99". The trouble is, it's difficult to figure out the decimal point character any other way, hence I can understand why the Python locale is being used.

I appreciate I'm not coming up with any answers here. The big question seems to be whether decimal values can be received in a whole variety of formats (including monetary values), as per the comment on the GetDataDecimal function. If so, then this is not an easy problem to solve. If decimal values can truly be "12,345.678000", "12.345.678,99", "$876.55", or "-345678.77€", then this is very tricky. For example, how would you distinguish "1,234"? Is that one thousand two hundred and thirty four, or one point two three four? Short of making a test query like SELECT CAST(1.5 AS DECIMAL(2,1)) and grabbing the middle character, or perhaps parsing from the right and assuming the first period or comma is the decimal separator, I don't know how the decimal point character can be determined in a watertight manner.

Ref: https://docs.microsoft.com/en-us/globalization/locale/number-formatting

keitherskine avatar May 05 '20 23:05 keitherskine

The comments indicated that the "proper" way to read using a binary format weren't working for all database. Maybe the 5.0 version should default to binary but have a function to go back to strings and configuring the separator.

I'd love to be able to write a script that would print out the configuration you should set. The difficulty is you'd have to be tricky getting the information into the DB to read back. That is, if you don't know whether to use "." or "," you have to not use any in your inputs. That is, instead of "select 123.45" you need something like "select (cast 12345 as decimal(19,2)) / 100".

This might be a good idea even if just provided as a script alongside pyodbc.

mkleehammer avatar Apr 11 '23 15:04 mkleehammer