spark-daria icon indicating copy to clipboard operation
spark-daria copied to clipboard

RDD Free approach to update nullability of Dataframe

Open zeotuan opened this issue 6 months ago • 0 comments
trafficstars

Description: Currently, updating the nullability metadata of a DataFrame schema in Spark is most commonly done via:

spark.createDataFrame(df.rdd, newSchema)

This approach is inefficient and low-level, especially when working with the higher-level DataFrame API. While there are some column-specific workarounds (e.g., using casts) that can trigger the query planner to adjust nullability, there is no straightforward way to align the entire schema — particularly nullability — without dropping down to the RDD layer.

Proposed Feature:

Introduce a first-class, RDD-free API for updating the nullability of DataFrame columns.

Example concept:

df.to(targetSchema, alignNullability = true)

Benefit:

  • Apply schema alignment (including nullability) using the query planner.
  • Avoid data re-encoding and RDD-level transformations.
  • Preserve existing data

zeotuan avatar May 08 '25 23:05 zeotuan