airflow icon indicating copy to clipboard operation
airflow copied to clipboard

Changes to SqlToS3Operator Breaking CSV formats

Open harryplumer opened this issue 3 years ago • 1 comments

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==5.1.0

Apache Airflow version

2.3.4

Operating System

Linux

Deployment

Astronomer

Deployment details

No response

What happened

Once https://github.com/apache/airflow/pull/25083 was merged, when using CSV as the output format on the SqlToS3Operator, null strings started appearing as "None" in the actual CSV export. This will cause unintended behavior in most use cases for reading the CSV including uploading to databases.

Certain databases such as Snowflake allow for things like NULL_IF on import however there are times where you would want the actual string "None" to be in the field and there would be no way at that point to distinguish.

Before:

Screen Shot 2022-09-21 at 11 36 00 AM

After:

Screen Shot 2022-09-21 at 11 35 52 AM

What you think should happen instead

The strings should be empty as they did previously. I understand the implementation of the recent PR for parquet and propose that we add an additional condition to line 138 of the sql_to_s3.py file restricting that to only if the chosen output is parquet.

How to reproduce

Run the SqlToS3Operator with the default output format of CSV on any query that selects a column of type string that allows null. Look the outputted CSV in S3.

Anything else

Every time we select a nullable column with the SqlToS3Operator

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

harryplumer avatar Sep 21 '22 17:09 harryplumer

Feel free to open a PR!

josh-fell avatar Sep 21 '22 18:09 josh-fell