amazon-redshift-python-driver
amazon-redshift-python-driver copied to clipboard
Redshift Driver Truncates Important Information in Error Messages
Driver version
2.0.907
Redshift version
Redshift 1.0.38698
Client Operating System
Docker python:3.10.2 image
Python version
3.10.2
Table schema
test_column: INTEGER
Problem description
When attempting to copy a file from S3 into Redshift via awswrangler, a data type mismatch will correctly throw an error. However, the error message is truncated, which makes it hard to debug the issue in non-trivial applications
Python Driver trace logs
redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': "
error: Spectrum Scan Error
code: 15007
context: File 'https://s3.region.amazonaws.com/bucket/bucket_directory/subdirectory/afile.snappy.parquet' has an incompatible Parquet schema for column 's3://bucket/bucket_director
query: 1234567
location: dory_util.cpp:1226
process: worker_thread [pid=12345]
Reproduction code
import pandas as pd
import awswrangler as wr
df = pandas.DataFrame([[1.23]], columns="test_column") # target schema is an integer, this is a float
wr.redshift.copy(
df=df,
path="s3://bucket/bucket_directory/subdirectory/",
table="test",
)
Hey @WillAyd , thank you for reporting this issue. I agree, this behavior would make debugging harder (not to mention more frustrating).
redshift_connector's error handling should just output what is received from Redshift server via postgres wire some docs here on what the incoming data looks like.
As for next steps, I will try to reproduce this locally to determine if there is an issue with redshift_connector. If not, my guess would be that Redshift server is truncating the error message. In that case, I will chat with the Redshift server team to see what our options are.
Hey @WillAyd ,
I just noticed the error message you provided isn't complete -- there's an open quote for the error dict's "D" key but no closing quote. Could you provide the whole error message please? Not sure if just the end quote is missing but want to double check.
Hi @Brooke-white - that is how the error gets printed to the console
Hey @WillAyd , thanks for the additional information. If it's accessible to you, could you please run the statement directly with redshift_connector? The lack of closing bracket for the error dictionary makes me wonder if the error message is being truncated after it's raised in redshift_connector.
I can't share the full traceback as this is on a work machine not connected to my GitHub account, but some debugging shows this at the top of the stack trace when the error is thrown:
https://github.com/aws/amazon-redshift-python-driver/blob/6a2ef2581770c3cd8b2a6941e164c8d7af2d4c6d/redshift_connector/core.py#L1986
If I set a breakpoint there and inspect the error before it gets raised, it appears already truncated when wrapped in the ProgrammingError object
Hey @WillAyd ,
Thanks for your patience. I have been unable to reproduce this on my end. I've included my code snippit below, where I tried an s3 copy with a very long file name. The error output was not truncated. This makes me think there could be something outside of redshift_connector truncating the error messages, since its especially strange the error dict is missing the closing brace.
FWIW my test environment was using PyCharm on Mac OS. I see you're running using Docker. Could docker be truncating the error messages? If possible, I'd recommend to try running locally and see if the same issue shows up.
with conn.cursor() as cursor:
cursor.execute("create table catdemo(catid smallint, catgroup varchar(10), catname varchar(10), catdesc varchar(50));")
cursor.execute("""copy catdemo
from 's3://my-test-bucket-githubissue107/Pneumonoultramicroscopicsilicovolcanoconiosis_Pneumonoultramicroscopicsilicovolcanoconiosis_Pneumonoultramicroscopicsilicovolcanoconiosis_category_pipe.txt'
iam_role 'arn:aws:iam::123456789012:role/myRedshiftRole'
FORMAT AS PARQUET;""")
>>> redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': "\n -----------------------------------------------\n error: Spectrum Scan Error\n code: 15007\n context: File 'https://s3.us-east-2.amazonaws.com/my-test-bucket-githubissue107/Pneumonoultramicroscopicsilicovolcanoconiosis_Pneumonoultramicroscopicsilicovolcanoconiosis_Pneumonoultramicroscopicsilicovolcanoconiosis_category_pipe.txt' has an invalid version number.\n query: 5344549\n location: dory_util.cpp:1377\n process: worker_thread [pid=13737]\n -----------------------------------------------\n", 'F': '../src/sys/xen_execute.cpp', 'L': '10512', 'R': 'pg_throw'}
Docker versus running directly on the host didn't make a difference. I ran directly as you did as well and was still getting the issue.
I noticed that your error is slightly different though, as you get ...has an invalid version number.. I'm wondering if the back half of the error message is what gets truncated, as mine would say something like has an incompatiable Parquet schema for column 's3://my-test-bucket-githubissue107/Pneumonoultram\n . If you change any of the columns in the file you are loading to have the wrong data type for the target table are you getting the same issue?
I'm wondering if the back half of the error message is what gets truncated, as mine would say something like
has an incompatiable Parquet schema for column 's3://my-test-bucket-githubissue107/Pneumonoultram\n. If you change any of the columns in the file you are loading to have the wrong data type for the target table are you getting the same issue?
good thinking -- let me give this a try and I'll get back to you
I believe I'm hitting the same type of error as you're seeing, though I'm not seeing the truncation of the error message, I've included repro steps below:
- download & upload some test parquet file I used this one
- view the parquet file to figure out column names https://www.parquet-viewer.com/
- incorrectly map column datatypes in copy statement
cursor.execute("create table test_tbl (registration_dttm datetime, id int, first_name varchar, last_name varchar, email varchar, gender varchar, ip_address varchar, cc varchar, country varchar, birthdate varchar, salary numeric(20,10), title varchar, comments numeric(20, 10));")
cursor.execute("""COPY test_tbl
FROM 's3://my-test-bucket-githubissue107/userdata1.parquet'
IAM_ROLE 'arn:aws:iam::123456789012:role/myRedshiftRole'
FORMAT AS PARQUET;""")
>>> redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': "\n -----------------------------------------------\n error: Spectrum Scan Error\n code: 15007\n context: File 'https://s3.us-east-2.amazonaws.com/my-test-bucket-githubissue107/userdata1.parquet' has an incompatible Parquet schema for column 's3://my-test-bucket-githubissue107/userdata1.parquet.salary'. Column type: DECIMAL(20,10), Parquet schema:\noptional double salary [i:\n query: 5349558\n location: dory_util.cpp:1377\n process: worker_thread [pid=11807]\n -----------------------------------------------\n", 'F': '../src/sys/xen_execute.cpp', 'L': '10512', 'R': 'pg_throw'}
I was still getting the same issue with that file. The difference I've noticed however is that my S3 paths are much longer, and that seems to impact the error message truncation in non-obvious ways. Can you try placing the file in these locations and loading from there?
s3://my-test-bucket-githubissue107/this_is_a_subfolder/_another_one/this_is_a_very_long_folder/userdata1.parquet
s3://my-test-bucket-githubissue107/this_is_a_subfolder/_another_one/this_folder_might_be_even_longer/userdata1.parquet
s3://my-test-bucket-githubissue107/this_is_a_subfolder/_another_one/thisfolderisreallylongwithnoboundaries/userdata1.parquet
Hey @WillAyd , I am finally able to somewhat reproduce! I am seeing the truncation of the server error message but the Python error message is intact (i.e. the whole error dict is there).
cursor.execute("create table test_tbl (registration_dttm datetime, id int, first_name varchar, last_name varchar, email varchar, gender varchar, ip_address varchar, cc varchar, country varchar, birthdate varchar, salary numeric(20,10), title varchar, comments numeric(20, 10));")
cursor.execute("""COPY test_tbl
FROM 's3://my-test-bucket-githubissue107/this_is_a_subfolder/_another_one/thisfolderisreallylongwithnoboundaries/now_its_even_longer_wow_impressive/userdata1.parquet'
IAM_ROLE 'arn:aws:iam::123456789012:role/myRedshiftRole'
FORMAT AS PARQUET;""")
redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': 'XX000', 'M': 'Spectrum Scan Error', 'D': "\n -----------------------------------------------\n error: Spectrum Scan Error\n code: 15007\n context: File 'https://s3.us-east-2.amazonaws.com/my-test-bucket-githubissue107/this_is_a_subfolder/_another_one/thisfolderisreallylongwithnoboundaries/now_its_even_longer_wow_impressive/userdata1.parquet' has an incompatible Parquet schema for column 's3://my-test-bucket\n query: 5367908\n location: dory_util.cpp:1377\n process: worker_thread [pid=27516]\n -----------------------------------------------\n", 'F': '../src/sys/xen_execute.cpp', 'L': '10512', 'R': 'pg_throw'}
As the truncated error message lies in the middle of the data received from Redshift server, I will reach out to the Redshift server team to raise this issue
Awesome! That's the main thing I'm after - I think everything else in Python is fine.
The issue has been opened, I'll keep this open in the meantime and provide updates here once available :)
For any readers that may come across the same issue, I've noticed that while the traceback in Python truncates this error message you can still get the full error message if you query SVL_S3LOG . Might help as a workaround
i am hitting the same error here where the python traceback truncates the actual error message and doesn't include all relevant details. SVL_S3LOG is a fine work around, but for my usecase is not the most intuitive way of delivering information.
is there any update on this ticket?
Hi @justbldwn -- unfortunately I do not have any update from my end. As many engineers are on holiday this week I will ping the team again next week for an update on this.
Haven't forgot about this, I'm working to escalate the ticket I opened with the team for this issue. Ref: 6124
Hi folks, If you are impacted by this issue I suggest to open an issue with AWS customer support and reference "Redshift-6124" to encourage prioritization of a fix for this.