oneDAL icon indicating copy to clipboard operation
oneDAL copied to clipboard

Add other stats for low-order moments

Open xwu99 opened this issue 2 years ago • 3 comments

We are using oneDAL distr algos to optimize Spark ML. Some metrics are missing and Could you check if you can add the following stats in distributed low-order moments (basic stats) ?

  • count
  • numNonzeros
  • weightSum
  • normL1
  • normL2

Check for details: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html

xwu99 avatar Nov 29 '21 02:11 xwu99

Clarification details per our discussion with Xiaochang:

  • Count: [Xiaochang]: User usually get several metrics instead of single one, it's convenient for them to get observations’ count from result along with other metrics. Otherwise, user needs extra coding effort.
  • numNonzeroes: [Xiaochang]: just count the number of non 0.0
  • weightSum: [Xiaochang]: there is a separate column called weight in Spark's dataframe for each row. Need to investigate possibility of adding corresponding API into compute_inpute and compute_result.

Also, need to check how much adding all these metrics will affect performance of default case (when all metrics are calculated).

makart19 avatar Dec 13 '21 10:12 makart19

Thanks @makart19. For weight column, Could also consider a general support for weighted points as a general feature for all algorithms, such as weighted points for kmeans etc. Check Spark's Kmeans, there is a optional weightCol to be set. https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.KMeans.html

xwu99 avatar Dec 13 '21 13:12 xwu99

Ok, we will consider weights support for other algorithms

makart19 avatar Dec 13 '21 14:12 makart19