statistics `exceptions`-based exceptions

Rather than Nothing/0/NaN etc. (the first option being way better than the others), it would be great to generalize code that may throw to the MonadThrow class from exceptions.

This way, functions using throwM (e :: Exception) would have the signature MonadThrow m => ... -> m ( ... ), where m may become Maybe, or Either e, or even IO, according to the calling context.

Mar 28 '18 15:03 ocramz

Related: #128 , #100 , #111 , #118 ...

Mar 28 '18 15:03 ocramz

That's excellent suggestion!

Mar 29 '18 08:03 Shimuuar

I've started addressing this here: https://github.com/DataHaskell/statistics/tree/exceptions-not-error

Jul 19 '18 22:07 ocramz

I'm actually halfway through implementing it. Thing us once you touch S.Sample you need to adjust basically everything

Jul 20 '18 04:07 Shimuuar

Yes, I noticed, error is used pretty much throughout. We could skip refactoring the input validation parts for now (i.e. zero input size or negative parameters etc.) and focus on the important ones, e.g. the NaN correlations etc. For example, I've replaced Sample.correlation with this:

-- | Correlation coefficient for sample of pairs. Also known as
--   Pearson's correlation. For empty sample it's set to zero.
correlation :: (G.Vector v (Double,Double), G.Vector v Double, MonadThrow m)
           => v (Double,Double)
           -> m Double
correlation xy
  | n == 0    = pure 0
  | nearZero varX = throwM $ NaNE "Variance of X == 0"
  | nearZero varY = throwM $ NaNE "Variance of Y == 0"
  | otherwise = pure corr
  where
    corr = cov / sqrt (varX * varY)
    n       = G.length xy
    (xs,ys) = G.unzip xy
    (muX,varX) = meanVariance xs
    (muY,varY) = meanVariance ys
    cov = mean $ G.zipWith (*)
            (G.map (\x -> x - muX) xs)
            (G.map (\y -> y - muY) ys)
{-# SPECIALIZE correlation :: U.Vector (Double,Double) -> Maybe Double #-}
{-# SPECIALIZE correlation :: V.Vector (Double,Double) -> Maybe Double #-}

Jul 20 '18 08:07 ocramz

@Shimuuar would you like to join forces on this? I don't have an efficient implementation in mind for Matrix.generateSym , though

Jul 20 '18 11:07 ocramz

@Shimuuar https://github.com/Shimuuar would you like to join forces on this?

Sure although I won't be able to do anything till monday

Jul 20 '18 11:07 Shimuuar

Hi @Shimuuar :) as discussed, if you point me to your working branch for this we can figure out how to collaborate :)

Jul 24 '18 12:07 ocramz

I just pushed branch exception2 (exception was complete failure). It's mostly complete except for

Statistics.Sample some functions are commented out and I'm thinking about using type classes from monoid-statistics for things like calculation of mean and variance in single call (saving one evaluation of mean). Having dedicated functions is not terribly good since in that case we have combinatorial explosion.
Resampling. Again I'm thinking about jackknife which is clearly monoidal (although it's obscured by API)
Bootstrap didn't even touch it
Regression depends on resampling
KruskalWallis test
Few other thing I certainly forgot about

monoid-statistics is in rather poor state currently. I got lost in figuring out numeric precision and performance of different algorithms for variance

Jul 24 '18 19:07 Shimuuar

@Shimuuar Re. monoid-statistics ; did you know of foldl-statistics? https://hackage.haskell.org/package/foldl-statistics

Jul 25 '18 09:07 ocramz

Yes. Main difference is monoid-statistics exposes accumulator types and allows to merge estimates with several data set without refolding them.

Jul 25 '18 09:07 Shimuuar

Aha! that's a clever thing to have. However what do you think of setting up speed benchmarks before looking into adding streaming capabilities? I would like to start adding basic summary functionality to criterion-measurement soon, to make it self-contained .

On Wed, Jul 25, 2018 at 11:33 AM, Aleksey Khudyakov < [email protected]> wrote:

Yes. Main difference is monoid-statistics exposes accumulator types and allows to merge estimates with several data set without refolding them.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bos/statistics/issues/141#issuecomment-407694775, or mute the thread https://github.com/notifications/unsubscribe-auth/AFoRqORK8RmfndEm34yTXJO7Ia-fMWfcks5uKDuFgaJpZM4S-3YM .

Jul 25 '18 09:07 ocramz

Why, of course! Without benchmarks all performance statements are just hopes and prayers

Jul 25 '18 09:07 Shimuuar