spark-daria
spark-daria copied to clipboard
CustomTransform RequiredColumns & AddedColumns are case sensitive
Hi.
I would like to know if there is a way to turn off case sensitivity on requiredColumns and addedColumns? Even if I have spark.sql.caseSensitive set to false my unit test is still failing.
sparkSession.conf.set("spark.sql.caseSensitive", false)
test("CustomTransform RequiredColumns & AddedColumns are case sensitive") {
val lowercaseDF = spark.createDF(List(("Hello, world")), List(("lowercase", StringType, false)))
lowercaseDF
.trans(
CustomTransform(
requiredColumns = Seq("LOWERCASE"),
transform = withTest(),
addedColumns = Seq("test"),
)
)
def withTest()(df: DataFrame): DataFrame = {
df.withColumn("test", lit("A simple test."))
}
}
The [LOWERCASE] columns are not included in the DataFrame with the following columns [lowercase] com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [LOWERCASE] columns are not included in the DataFrame with the following columns [lowercase] at com.github.mrpowers.spark.daria.sql.DataFrameColumnsChecker.validatePresenceOfColumns(DataFrameColumnsChecker.scala:19)
Thank you!
Hi,
This seems to be a problem with how the library is validating the columns. I can go ahead and fix this problem by applying the following change if @MrPowers agrees with that.
I would change class com.github.mrpowers.spark.daria.sql.DataFrameColumnsChecker, from val missingColumns = requiredColNames.diff(df.columns.toSeq) to
val givenColumns = df.columns.toSeq.map(_.toLowerCase)
val requiredColumnsLower = requiredColNames.map(_.toLowerCase)
requiredColumnsLower.diff(givenColumns)
That way the block of code keeps with time complexity O(n) and the problem is solved.