spark-perf
spark-perf copied to clipboard
MLlib TODO items
- [ ] Change Scala testName to match Python test names: “glm-regression” —> GLMRegressionTest
- [ ] Make parameter names match across all tests. (num-examples, num-rows, etc.)
- [ ] Refactor correlation tests so pearson/spearman is a parameter.
- [ ] Better data generation in Python
Would be great to measure performance loss in pyspark vs scala for mllib models implemented in scala.
This can be done by running both sets of tests. (They use the same set of parameters in the config file.) I've done it some, and the change in performance varies based on the particular algorithm. For long training times, it does not matter. For prediction, it varies some. The Spark 1.2 release brought Python a lot closer to Scala for prediction.
Any plans to test performance new ml API (Pipeline, Crossvalidation, GridSearch, etc.)?
I don't think we will for this release, but we will need to for the next one. We've been focusing on the API for now, but I hope the API can be stabilized before long.