Problem with colum names that have dot charachter
Below constraint suggestion code produces
org.apache.spark.sql.AnalysisException: syntax error in attribute name: Phone No.;
The problem is that dot character is a part of the column name (Phone No.). When I remove dot from name code runs fine.
import com.amazon.deequ.SparkContextSpec import com.amazon.deequ.suggestions.{ConstraintSuggestionRunner, Rules} import com.amazon.deequ.utils.FixtureSupport import org.apache.spark.sql.{DataFrame, SparkSession} import org.scalatest.{Matchers, WordSpec}
class TestSuggestions extends WordSpec with Matchers with SparkContextSpec with FixtureSupport {
def testData(sparkSession: SparkSession): DataFrame = { import sparkSession.implicits._
Seq(
("CA", "123"),
("SD", "1233"),
("NC", "1236")
)
.toDF("state", "Phone No.")
}
"column name problem" should {
"" in withSparkSession { session =>
val data = testData(session)
data.show(false)
val suggestionResult = ConstraintSuggestionRunner()
.onData(data)
.addConstraintRules(Rules.DEFAULT)
.run()
suggestionResult.constraintSuggestions.foreach { case (column, suggestions) =>
suggestions.foreach { suggestion =>
println(s"Constraint suggestion for '$column':\t${suggestion.description}\n" +
s"The corresponding scala code is ${suggestion.codeForConstraint}\n")
}
}
}
} }
This is a known bug unfortunately, we do not correctly escape the column names in all places. We had a person working on this, but they unfortunately never finished their PR. As a work around, you could rename the column on the dataframe before running the test.
This is a known bug unfortunately, we do not correctly escape the column names in all places. We had a person working on this, but they unfortunately never finished their PR. As a work around, you could rename the column on the dataframe before running the test.
Hello, Somebody is working on this? I want to try to fix this. I saw this project in Upwork and very interesting in it.
Feel free to submit a PR for that, we would be very happy. Btw, what exactly do you mean by "saw this project on upwork"?
Thanks, I'll try to do it. I just say how I find this project. it's don't matter.
https://www.upwork.com/jobs/Looking-for-Scala-freelancer_~01a7fcfb68f4bfa618/
When the ConstraintSuggestionRunner is running, it will load some data which will be used to infer the Column data type by Spark SQL? Spark SQL load sample data fail because the column name contains an expectation char ".". So, we have to define a rule check the Column name first. If I am wrong, please feel free to let me know. Thanks.
Would it make more sense to change underlying col calls to using the wrapped version, or change the Analyzers to take a Column type instead? I'm looking into how much impact the former would take.
We should not change the API, as this might break existing code. I think what is needed is a proper escaping of the column names.
I've got a possible approach that won't break any APIs, I'll try to have a PR available soon
On Fri, Aug 14, 2020, 23:24 Sebastian [email protected] wrote:
We should not change the API, as this might break existing code. I think what is needed is a proper escaping of the column names.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/awslabs/deequ/issues/274#issuecomment-674356750, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKDO6A3ODAGMBFF4LBJ73SAYSZTANCNFSM4PZNBWUA .