snowflake-connector-python icon indicating copy to clipboard operation
snowflake-connector-python copied to clipboard

SNOW-823151: First result batch has int64 arrow type for NUMBER column with real values

Open tekumara opened this issue 2 years ago • 12 comments

When fetching multiple batches via get_result_batches , the first batch is empty and has the incorrect arrow data type for NUMBER columns containing real values, eg:

import snowflake.connector
import pyarrow

conn = snowflake.connector.connect(... )
cur = conn.cursor()
cur.execute("SELECT cast(1.5 as NUMBER(32,4)) as AMOUNT;")
batches = cur.get_result_batches()
batch = batches[0]
# the first batch is normally empty, which triggers _create_empty_table
# in this MRE we trigger it explicitly to demonstrate the error 
table = batch._create_empty_table()
print(table.schema)

assert table.schema.types[0] == pyarrow.float64()

Will fail with an assertion error and output:

AMOUNT: int64

snowflake-connector-python==3.0.3

tekumara avatar May 19 '23 06:05 tekumara

@tekumara did this happen with previous versions?

sfc-gh-achandrasekaran avatar May 19 '23 16:05 sfc-gh-achandrasekaran

I'm not entirely sure.

tekumara avatar May 19 '23 22:05 tekumara

hey @tekumara , you are just one step close to get the result you want. get_result_batches returns an ArrowResultChunk object -- it's a data structure storing data, not the data itself. You can either call the create_iter to create an iterator to iterate the data within the chunk or call to_arrow, to_pandas to convert to pyarrow Table or pandas DataFrame if you have installed the pandas dependency.

please let me know if you have any other questions, I'm more than happy to answer.

sfc-gh-aling avatar Jun 15 '23 05:06 sfc-gh-aling

I still have the same problem using those methods. The MRE above uses _create_empty_table() to make it easier to reproduce.

tekumara avatar Jun 15 '23 12:06 tekumara

can you try this?

conn = snowflake.connector.connect(**CONNECTION_PARAMETERS)
cur = conn.cursor()
cur.execute("SELECT cast(1.5 as NUMBER(32,4)) as AMOUNT;")

batches = cur.get_result_batches()
batch = batches[0]
table = batch.to_arrow()
assert table.schema.types[0] == pyarrow.float64()

sfc-gh-aling avatar Jun 15 '23 16:06 sfc-gh-aling

That example works but not in the general case ie: when I fetch and use batches[0].to_arrow() from one of our real tables containing a NUMBER(32,4) the first batch is empty. This triggers the _create_empty_table() code path and the fault demonstrated in the contrived MRE above.

tekumara avatar Jun 15 '23 22:06 tekumara

@tekumara Did you ever find out why the first batch always returns empty?

ShahNewazKhan avatar Mar 23 '24 01:03 ShahNewazKhan

No I didn't, are you seeing this too?

tekumara avatar Mar 23 '24 04:03 tekumara

No I didn't, are you seeing this too?

Yes I am seeing the same thing, when the connector returns a list of ResultBatch objects, I have found that if it is a large list and rowcounts exceed ~ 2k then the first ResultBatch is empty.

To work around this, when distributing ResultBatch objects across multiple workers I simply check if the ResultBatch is empty and don't process it in the worker.

ShahNewazKhan avatar Mar 23 '24 16:03 ShahNewazKhan

@sfc-gh-aling could you reopen this issue please as it’s still happening? 🙏

tekumara avatar Mar 23 '24 22:03 tekumara

thank you all for your feedbacks on this one! i reopened this issue and we'll take a look

sfc-gh-dszmolka avatar Mar 24 '24 09:03 sfc-gh-dszmolka