RedshiftDataOperator fails when `return_sql_result` is true, and SQL statements are provided
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
Affects all current versions of apache-airflow-providers-amazon, including 8.24.0
Apache Airflow version
N/A - all versions
Operating System
NA
Deployment
Other
Deployment details
No response
What happened
There is a bug in RedshiftDataOperator if multiple sql queries are passed, and return_sql_result is set to true, then you will get the following error:
An error occurred (ValidationException) when calling the GetStatementResult operation: BatchExecuteStatement result can only be retrieved with sub-statement id.
What you think should happen instead
No response
How to reproduce
Run a RedshiftDataOperator like this:
from airflow.providers.amazon.aws.operators.redshift_data import RedshiftDataOperator
run_query = RedshiftDataOperator(
task_id="run_query",
aws_conn_id="MY_AWS_CONNECTION",
#
# Redshift parameters
cluster_identifier="MY_REDSHIFT_CLUSTER",
region="MY_REGION",
database="MY_DB",
sql=[
"SELECT 1;",
"SELECT 2;",
"SELECT 3;",
],
#
# Redshift Data API parameters
return_sql_result=True,
statement_name="SOME_NAME",
secret_arn=(
"arn:aws:secretsmanager"
":MY_REGION:MY_ACCOUNT"
":secret:MY_DATA_API_CREDENTIAL_SECRET"
),
#
# Trigger parameters
wait_for_completion=True,
poll_interval=10,
)
Anything else
To fix this we will need to record the length of sql list, and loop through get_statement_result for each sub-statement id because you can only get one statement result at a time.
For example, if there are three, you would get the results of the following queries, starting from :1:
d9b6c0c9-0747-4bf4-b142-e8883122f766:1d9b6c0c9-0747-4bf4-b142-e8883122f766:2d9b6c0c9-0747-4bf4-b142-e8883122f766:3
There are TWO difference places where we call get_statement_result that need to be updated:
- For non-deferred mode in
execute():- https://github.com/apache/airflow/blob/providers-amazon/8.24.0/airflow/providers/amazon/aws/operators/redshift_data.py#L165
- For deferred mode in
execute_complete()- https://github.com/apache/airflow/blob/providers-amazon/8.24.0/airflow/providers/amazon/aws/operators/redshift_data.py#L186
- NOTE: we will need to have the deferred operator bubble the number of statements in its
yield TriggerEvent(response- https://github.com/apache/airflow/blob/providers-amazon/8.24.0/airflow/providers/amazon/aws/triggers/redshift_data.py#L99-L101
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hello @thesuperzapper,
I have made some changes to the RedshiftDataOperator to address the issue with handling multiple SQL statements. Please review my pull request https://github.com/apache/airflow/pull/40443 and let me know if my approach is correct.
Was this issue resolved? If not, I'd love to have it assigned to myself. Thanks!
@jroachgolf84 As of now, this issue has not been resolved. Please feel free to take it on. Thanks!
@jroachgolf84 As of now, this issue has not been resolved. Please feel free to take it on. Thanks!
Thank you! Do you mind adding me as the assignee?
assigned
@potiuk can you unassign @isatyamks as no progress has been made.
If anyone else wants to work on this, please comment.
@potiuk can you unassign @isatyamks as no progress has been made.
If anyone else wants to work on this, please comment.
I've got the skeleton of a pull request started, apologies for the delay. I'm continuing to work on this.
This issue can be closed, PR: https://github.com/apache/airflow/pull/42900 addressed this issue and was merged.