spark-redshift
spark-redshift copied to clipboard
Some columns in resulting dataframe have nullable = False
We've noticed some strange behavior regarding column nullability. When running a query to return a dataframe, the columns seem to be nullable only if the raw columns are selected. If a function on a column is selected, the corresponding column in the resulting dataframe is not nullable.
Here is a quick example demonstrating this:
>>> def load_from_redshift(query):
dw_connection_string = 'jdbc:redshift://MY_JDBC_PARAMS'
df = spark.read\
.format("com.databricks.spark.redshift").option("url", dw_connection_string) \
.option("query", query).option('aws_iam_role', MY_IAM_ROLE)\
.option("tempdir", MY_TEMP_DIR).load()
return df
>>> qry = """select id, id + 1 as id_plus_one, null as landmine from mytable"""
>>> df = load_from_redshift(qry)
>>> df.schema
StructType(List(
StructField(id,IntegerType,true),
StructField(id_plus_one,IntegerType,false),
StructField(landmine,StringType,false)
))
# gives org.apache.spark.SparkException > java.lang.NullPointerException
# because of the null values in the landmine field
>>> df.show()
Is there a way to edit this behavior? I'd love it if all columns could that result from a query are always nullable.
Thanks!
PS we're using spark 2.3.0 on databricks
It seems that there are no APIs to change the nullable property in SQL API.
We can change the schema by passing the new schema. Refer to https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe .