spark-daria
spark-daria copied to clipboard
RDD Free approach to update nullability of Dataframe
trafficstars
Description: Currently, updating the nullability metadata of a DataFrame schema in Spark is most commonly done via:
spark.createDataFrame(df.rdd, newSchema)
This approach is inefficient and low-level, especially when working with the higher-level DataFrame API. While there are some column-specific workarounds (e.g., using casts) that can trigger the query planner to adjust nullability, there is no straightforward way to align the entire schema — particularly nullability — without dropping down to the RDD layer.
Proposed Feature:
Introduce a first-class, RDD-free API for updating the nullability of DataFrame columns.
Example concept:
df.to(targetSchema, alignNullability = true)
Benefit:
- Apply schema alignment (including nullability) using the query planner.
- Avoid data re-encoding and RDD-level transformations.
- Preserve existing data