arrow-odbc-py icon indicating copy to clipboard operation
arrow-odbc-py copied to clipboard

how to deal with varchar(max) columns in mssql

Open TheDataScientistNL opened this issue 1 year ago • 4 comments

Hi, I am using polars==0.19.7, which now includes ODBC support through arrow-odbc-py (arrow-odbc==1.2.8).

When running the code, see example below, an error occurs from arrow-odbc.

SRNM = ''
PWD = ''
DBNAME = ''
HOST = ''
PORT = ''

CONN = f"Driver={{ODBC Driver 17 for SQL Server}};Server={HOST};Port={PORT};Database={DBNAME};Uid={USERNM};Pwd={PWD}"

df = pl.read_database(
    connection=CONN,
    query="SELECT varchar_max_col FROM [dbo].[tablname]",
)

with the error being:

arrow_odbc.error.Error: There is a problem with the SQL type of the column with name: varchar_max_col and index 0: ODBC reported a size of '0' for the column. This might indicate that the driver cannot specify a sensible upper bound for the column. E.g. for cases like VARCHAR(max). Try casting the column into a type with a sensible upper bound. The type of the column causing this error is Varchar { length: 0 }.

I can easily resolve this by editing the query to

df = pl.read_database( connection=CONN, query="SELECT CAST(varchar_max_col AS VARCHAR(100)) AS varchar_max_col FROM [dbo].[tablname]", ) which then resolves the issue (or change the column type in the database, but that is not something you want to do or always can do).

However, as varchar(max) columns still occur frequently in databases, I was wondering if there could be native support in arrow-odbc for this? In other words, it catches varchar(max) columns and optimizes the query to return these columns without throwing an error.

I hope this is the right place to ask the question, because I am not sure if this is arrow-odbc related or ODBC driver related...

TheDataScientistNL avatar Oct 05 '23 08:10 TheDataScientistNL

Hello @TheDataScientistNL ,

the best way to deal with VARCHAR(max) ist to set the max_text_size parameter. See the documentation here: https://arrow-odbc.readthedocs.io/en/latest/arrow_odbc.html#arrow_odbc.read_arrow_batches_from_odbc

You are not using the read_arrow_batches_from_odbc directly but via polars, which I think was added yesterday. Please ask the maintainers of polars how to forward this parameters or use arrow-odbc directly.

Best, Markus

pacman82 avatar Oct 05 '23 16:10 pacman82

I hope this is the right place to ask the question, because I am not sure if this is arrow-odbc related or ODBC driver related...

Neither it is ODBC standard related. It is an inherent limitation in the API. Avoid VARCHAR(max), TEXT or similar unbounded types in schema declarations, if you want fast bulk fetches. I take back what I said earlier. Best way to deal with this is to fix the schema, if possible.

pacman82 avatar Oct 05 '23 16:10 pacman82

And I was so hoping to avoid a mystery-meat **kwargs pass-through for all the different connection flavours we now support 🤣 I'll think about the cleanest thing we can expose.

alexander-beedie avatar Oct 06 '23 07:10 alexander-beedie

Just typing on my phone right now, so I will keep it short. I can sympathise with that. I wouldn't recommend a passthrough at all.

pacman82 avatar Oct 06 '23 15:10 pacman82