spark-daria icon indicating copy to clipboard operation
spark-daria copied to clipboard

CustomTransform RequiredColumns & AddedColumns are case sensitive

Open labbedaine opened this issue 2 years ago • 1 comments

Hi.

I would like to know if there is a way to turn off case sensitivity on requiredColumns and addedColumns? Even if I have spark.sql.caseSensitive set to false my unit test is still failing.

sparkSession.conf.set("spark.sql.caseSensitive", false)

test("CustomTransform RequiredColumns & AddedColumns are case sensitive") {
    val lowercaseDF = spark.createDF(List(("Hello, world")), List(("lowercase", StringType, false)))

    lowercaseDF
      .trans(
        CustomTransform(
          requiredColumns = Seq("LOWERCASE"),
          transform = withTest(),
          addedColumns = Seq("test"),
        )
      )

    def withTest()(df: DataFrame): DataFrame = {
      df.withColumn("test", lit("A simple test."))
    }
  }

The [LOWERCASE] columns are not included in the DataFrame with the following columns [lowercase] com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [LOWERCASE] columns are not included in the DataFrame with the following columns [lowercase] at com.github.mrpowers.spark.daria.sql.DataFrameColumnsChecker.validatePresenceOfColumns(DataFrameColumnsChecker.scala:19)

Thank you!

labbedaine avatar Nov 18 '22 17:11 labbedaine

Hi,

This seems to be a problem with how the library is validating the columns. I can go ahead and fix this problem by applying the following change if @MrPowers agrees with that.

I would change class com.github.mrpowers.spark.daria.sql.DataFrameColumnsChecker, from val missingColumns = requiredColNames.diff(df.columns.toSeq) to

    val givenColumns = df.columns.toSeq.map(_.toLowerCase)
    val requiredColumnsLower = requiredColNames.map(_.toLowerCase)
    requiredColumnsLower.diff(givenColumns)

That way the block of code keeps with time complexity O(n) and the problem is solved.

brayanjuls avatar Dec 05 '22 19:12 brayanjuls