oneDAL
oneDAL copied to clipboard
Add other stats for low-order moments
We are using oneDAL distr algos to optimize Spark ML. Some metrics are missing and Could you check if you can add the following stats in distributed low-order moments (basic stats) ?
- count
- numNonzeros
- weightSum
- normL1
- normL2
Check for details: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html
Clarification details per our discussion with Xiaochang:
- Count: [Xiaochang]: User usually get several metrics instead of single one, it's convenient for them to get observations’ count from result along with other metrics. Otherwise, user needs extra coding effort.
- numNonzeroes: [Xiaochang]: just count the number of non 0.0
- weightSum: [Xiaochang]: there is a separate column called weight in Spark's dataframe for each row. Need to investigate possibility of adding corresponding API into compute_inpute and compute_result.
Also, need to check how much adding all these metrics will affect performance of default case (when all metrics are calculated).
Thanks @makart19. For weight column, Could also consider a general support for weighted points as a general feature for all algorithms, such as weighted points for kmeans etc. Check Spark's Kmeans, there is a optional weightCol to be set. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html
Ok, we will consider weights support for other algorithms