aws-sdk-pandas icon indicating copy to clipboard operation
aws-sdk-pandas copied to clipboard

We can not infer the data type from an entire null object column

Open misteliy opened this issue 2 years ago • 3 comments

Describe the bug

if a column is null there should be a fallback data type (varchar)

image

I'm using:

                wr.redshift.copy_from_files(
                    path=path,
                    con=con,
                    table=file_name.replace(".parquet", ""),
                    schema="staging",
                    parquet_infer_sampling = 1,
                    varchar_lengths_default = 65535
                )

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.10.12

AWS SDK for pandas version

3.4.2

Additional context

No response

misteliy avatar Dec 08 '23 13:12 misteliy

Hi @misteliy it looks like there is an entire column with nulls in the data so we fail to recognise the type of the column.

As a hotfix, you can identify the column that has an issue and provide the list of valid columns via column_names parameter. I will check if there is anything else we can do to fix this on our side.

kukushking avatar Dec 21 '23 17:12 kukushking

Thanks 🙏 yes, that’s exactly what I have done 😊 wanted just raise this because it could maybe get more gracefully be handled

misteliy avatar Dec 21 '23 18:12 misteliy

Hello. I have received the same error message, in a different context. And found this ticket while investigating the problem.

Just wanted to share my two cents: I would prefer not to have a fallback. We will, as I understand it, just delay a potential error until a later time. If fallback is a string, and the actual type is an int, which will arrive in a future file, we will just get a type mismatch at that point in time,

What puzzles me, is that it needs to derive the the type, at least in the case of Parquet Parquet files containing metadata, why not simply take the type from the metadata? Do the current approach relate to partitioning, since there are no metadata for those?

t15k avatar Feb 12 '24 12:02 t15k

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

github-actions[bot] avatar Apr 12 '24 15:04 github-actions[bot]