incubator-hivemall icon indicating copy to clipboard operation
incubator-hivemall copied to clipboard

[HIVEMALL-182][SPARK][WIP] Add an optimizer rule to filter out columns with low variances

Open maropu opened this issue 8 years ago • 4 comments

What changes were proposed in this pull request?

This pr added a new optimizer rule VarianceThreshold in Spark;

scala> spark.read.option("inferSchema", "true").csv("test.csv").write.saveAsTable("t")
scala> sql("SELECT * FROM t").show
+---+--------+---+----+
|_c0|     _c1|_c2| _c3|
+---+--------+---+----+
|  1|   "one"|1.0| 1.0|
|  1|   "two"|1.1| 2.3|
|  1| "three"|0.9| 3.5|
|  1|   "one"|0.9|10.3|
+---+--------+---+----+

// Enables the optimizer rule and prints again
scala> sql("spark.sql.cbo.enabled=true")
scala> sql("spark.sql.statistics.histogram.enabled=true")
scala> sql("spark.sql.optimizer.featureSelection.enabled=true")
scala> sql("spark.sql.optimizer.featureSelection.varianceThreshold=0.10")
scala> sql("SELECT * FROM t").show
+--------+----+
|     _c1| _c3|
+--------+----+
|   "one"| 1.0|
|   "two"| 2.3|
| "three"| 3.5|
|   "one"|10.3|
+--------+----+

TODO

  • Add docs in gitbook
  • Add more tests
  • Brush up VarianceThreshold code

What type of PR is it?

Feature

What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-182

How was this patch tested?

Added tests in FeatureSelectionRuleSuite.

maropu avatar Mar 29 '18 22:03 maropu

@maropu CI failing.

myui avatar Apr 01 '18 09:04 myui

I'll fix later.

maropu avatar Apr 01 '18 21:04 maropu

@maropu is this PR still WIP?

myui avatar Aug 13 '18 20:08 myui

Sorry for my slow work. I'm checking the feasibility on my separate repo (because there are some issues to solve): https://github.com/maropu/spark-catalyst-rule-rewiter/tree/master So, please give me more time and thanks.

maropu avatar Aug 15 '18 00:08 maropu