BUG: read_sql tries to convert blob/varbinary to string with pyarrow backend
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import sqlite3
import pandas as pd
db = sqlite3.connect("file::memory:?cache=shared")
query = f"""
select cast(x'0123456789abcdef0123456789abcdef' as blob) a
"""
df = pd.read_sql(query, db, dtype_backend='pyarrow')
display(df)
Issue Description
This fails with
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 4: invalid start byte
The repro is for sqllite, but the issue is the same with sqlalchemy and pyodbc.
Also read_sql_table fails with the same error.
Expected Behavior
Should succeed and return a dataframe with a binary column. It works with the default backend.
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140 python : 3.10.12.final.0 python-bits : 64 OS : Linux OS-release : 5.15.158.2-1.cm2 Version : #1 SMP Sun Jun 9 18:33:38 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.2.2 numpy : 1.24.3 pytz : 2023.3.post1 dateutil : 2.8.2 setuptools : 68.2.2 pip : 23.1.2 Cython : 3.0.4 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.9.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.14.0 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.2 bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : 2023.10.0 gcsfs : None matplotlib : 3.7.2 numba : 0.57.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : 2.0.22 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : 0.21.0 tzdata : 2023.3 qtpy : None pyqt5 : None
Thanks for the report. This was likely introduced with #50048 when the nullable keyword (now renamed to dtype_backend) was added to sql functions.
There is a comment https://github.com/pandas-dev/pandas/pull/50048#discussion_r1184376583 which is where it assumes it's string data. cc @phofl
Full Traceback
Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0x89 in position 4: invalid start byte
File "/home/asishm/pandas-asishm/pandas/core/arrays/string_.py", line 412, in _from_sequence
result = lib.ensure_string_array(scalars, na_value=libmissing.NA, copy=copy)
File "/home/asishm/pandas-asishm/pandas/core/internals/construction.py", line 972, in convert
arr = arr_cls._from_sequence(arr, dtype=new_dtype)
File "/home/asishm/pandas-asishm/pandas/core/internals/construction.py", line 993, in <listcomp>
arrays = [convert(arr) for arr in content]
File "/home/asishm/pandas-asishm/pandas/core/internals/construction.py", line 993, in convert_object_array
arrays = [convert(arr) for arr in content]
File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 161, in _convert_arrays_to_dataframe
arrays = convert_object_array(
File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 198, in _wrap_result
frame = _convert_arrays_to_dataframe(data, columns, coerce_float, dtype_backend)
File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 2738, in read_query
frame = _wrap_result(
File "/home/asishm/pandas-asishm/pandas/io/sql.py", line 691, in read_sql
return pandas_sql.read_query(
File "/home/asishm/pd-issues/59242.py", line 12, in <module>
df = pd.read_sql(query, db, dtype_backend='pyarrow')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 4: invalid start byte
take
take