deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Problem with colum names that have dot charachter

Open dmiljkovic opened this issue 5 years ago • 8 comments

Below constraint suggestion code produces

org.apache.spark.sql.AnalysisException: syntax error in attribute name: Phone No.;

The problem is that dot character is a part of the column name (Phone No.). When I remove dot from name code runs fine.

import com.amazon.deequ.SparkContextSpec import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules} import com.amazon.deequ.utils.FixtureSupport import org.apache.spark.sql.{DataFrame, SparkSession} import org.scalatest.{Matchers, WordSpec}

class TestSuggestions extends WordSpec with Matchers with SparkContextSpec with FixtureSupport {

def testData(sparkSession: SparkSession): DataFrame = { import sparkSession.implicits._

Seq(
  ("CA", "123"),
  ("SD", "1233"),
  ("NC", "1236")
)
  .toDF("state", "Phone No.")

}

"column name problem" should {

"" in withSparkSession { session =>

  val data = testData(session)
  data.show(false)
  val suggestionResult = ConstraintSuggestionRunner()
    .onData(data)
    .addConstraintRules(Rules.DEFAULT)
    .run()
  suggestionResult.constraintSuggestions.foreach { case (column, suggestions) =>
    suggestions.foreach { suggestion =>
      println(s"Constraint suggestion for '$column':\t${suggestion.description}\n" +
        s"The corresponding scala code is ${suggestion.codeForConstraint}\n")
    }
  }
}

} }

dmiljkovic avatar Aug 09 '20 22:08 dmiljkovic

This is a known bug unfortunately, we do not correctly escape the column names in all places. We had a person working on this, but they unfortunately never finished their PR. As a work around, you could rename the column on the dataframe before running the test.

sscdotopen avatar Aug 11 '20 06:08 sscdotopen

This is a known bug unfortunately, we do not correctly escape the column names in all places. We had a person working on this, but they unfortunately never finished their PR. As a work around, you could rename the column on the dataframe before running the test.

Hello, Somebody is working on this? I want to try to fix this. I saw this project in Upwork and very interesting in it.

TianLangStudio avatar Aug 11 '20 10:08 TianLangStudio

Feel free to submit a PR for that, we would be very happy. Btw, what exactly do you mean by "saw this project on upwork"?

sscdotopen avatar Aug 11 '20 18:08 sscdotopen

Thanks, I'll try to do it. I just say how I find this project. it's don't matter.
https://www.upwork.com/jobs/Looking-for-Scala-freelancer_~01a7fcfb68f4bfa618/

TianLangStudio avatar Aug 12 '20 02:08 TianLangStudio

When the ConstraintSuggestionRunner is running, it will load some data which will be used to infer the Column data type by Spark SQL? Spark SQL load sample data fail because the column name contains an expectation char ".". So, we have to define a rule check the Column name first. If I am wrong, please feel free to let me know. Thanks.

TianLangStudio avatar Aug 12 '20 14:08 TianLangStudio

Would it make more sense to change underlying col calls to using the wrapped version, or change the Analyzers to take a Column type instead? I'm looking into how much impact the former would take.

eadgbear avatar Aug 13 '20 21:08 eadgbear

We should not change the API, as this might break existing code. I think what is needed is a proper escaping of the column names.

sscdotopen avatar Aug 15 '20 06:08 sscdotopen

I've got a possible approach that won't break any APIs, I'll try to have a PR available soon

On Fri, Aug 14, 2020, 23:24 Sebastian [email protected] wrote:

We should not change the API, as this might break existing code. I think what is needed is a proper escaping of the column names.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/awslabs/deequ/issues/274#issuecomment-674356750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKDO6A3ODAGMBFF4LBJ73SAYSZTANCNFSM4PZNBWUA .

eadgbear avatar Aug 15 '20 09:08 eadgbear