spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

Some columns in resulting dataframe have nullable = False

Open andrew-candela opened this issue 7 years ago • 1 comments

We've noticed some strange behavior regarding column nullability. When running a query to return a dataframe, the columns seem to be nullable only if the raw columns are selected. If a function on a column is selected, the corresponding column in the resulting dataframe is not nullable.

Here is a quick example demonstrating this:

>>> def load_from_redshift(query):
    dw_connection_string = 'jdbc:redshift://MY_JDBC_PARAMS'

    df = spark.read\
        .format("com.databricks.spark.redshift").option("url", dw_connection_string) \
        .option("query", query).option('aws_iam_role', MY_IAM_ROLE)\
        .option("tempdir", MY_TEMP_DIR).load()
    return df


>>> qry = """select id, id + 1 as id_plus_one, null as landmine from mytable"""
>>> df = load_from_redshift(qry)

>>> df.schema
  StructType(List(
    StructField(id,IntegerType,true),
    StructField(id_plus_one,IntegerType,false),
    StructField(landmine,StringType,false)
  ))

# gives org.apache.spark.SparkException > java.lang.NullPointerException 
# because of the null values in the landmine field
>>> df.show()

Is there a way to edit this behavior? I'd love it if all columns could that result from a query are always nullable.

Thanks!

PS we're using spark 2.3.0 on databricks

andrew-candela avatar Aug 29 '18 19:08 andrew-candela

It seems that there are no APIs to change the nullable property in SQL API.

We can change the schema by passing the new schema. Refer to https://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe .

tobegit3hub avatar Jun 09 '20 13:06 tobegit3hub