incubator-hivemall
incubator-hivemall copied to clipboard
[HIVEMALL-182][SPARK][WIP] Add an optimizer rule to filter out columns with low variances
What changes were proposed in this pull request?
This pr added a new optimizer rule VarianceThreshold in Spark;
scala> spark.read.option("inferSchema", "true").csv("test.csv").write.saveAsTable("t")
scala> sql("SELECT * FROM t").show
+---+--------+---+----+
|_c0| _c1|_c2| _c3|
+---+--------+---+----+
| 1| "one"|1.0| 1.0|
| 1| "two"|1.1| 2.3|
| 1| "three"|0.9| 3.5|
| 1| "one"|0.9|10.3|
+---+--------+---+----+
// Enables the optimizer rule and prints again
scala> sql("spark.sql.cbo.enabled=true")
scala> sql("spark.sql.statistics.histogram.enabled=true")
scala> sql("spark.sql.optimizer.featureSelection.enabled=true")
scala> sql("spark.sql.optimizer.featureSelection.varianceThreshold=0.10")
scala> sql("SELECT * FROM t").show
+--------+----+
| _c1| _c3|
+--------+----+
| "one"| 1.0|
| "two"| 2.3|
| "three"| 3.5|
| "one"|10.3|
+--------+----+
TODO
- Add docs in gitbook
- Add more tests
- Brush up
VarianceThresholdcode
What type of PR is it?
Feature
What is the Jira issue?
https://issues.apache.org/jira/browse/HIVEMALL-182
How was this patch tested?
Added tests in FeatureSelectionRuleSuite.
@maropu CI failing.
I'll fix later.
@maropu is this PR still WIP?
Sorry for my slow work. I'm checking the feasibility on my separate repo (because there are some issues to solve): https://github.com/maropu/spark-catalyst-rule-rewiter/tree/master So, please give me more time and thanks.