math adding dirichlet distribution

I am adding Dirichlet distribution in include/boost/math/distributions/. The Dirichlet distribution is a multivariate distribution function, so vector type inputs are needed, but the tests in test/compile_test/test_compile_result.hpp are for int, double, float, long double, unsigned. How can I add tests for Dirichlet distribution? TODO:

[ ] Add tests.
[ ] Add google benchmark.
[x] Decision on VectorType and it's implementation.
[x] Cross check cdf function. (I could not find it, so tried to derive it.)
[x] Implementation of skewness function
[x] Implementation of kurtosis function.

Also, none of the distribution functions generate random samples of the corresponding distribution. That means no sampling function implemented. In PyTorch, we can do something like,

>>> import torch
>>> import torch.distributions as dist
>>> d = dist.dirichlet.Dirichlet(torch.tensor([0.3, 0.3, 0.3]))
>>> d.sample()
tensor([0.7736, 0.1908, 0.0356])
>>> d.rsample(torch.tensor([5]))
tensor([[2.3072e-02, 9.4317e-01, 3.3756e-02],
        [1.8339e-01, 1.6803e-03, 8.1493e-01],
        [7.3136e-01, 1.9131e-01, 7.7323e-02],
        [2.1086e-01, 9.5684e-03, 7.7958e-01],
        [9.6672e-06, 4.6919e-02, 9.5307e-01]])
>>>

Can we add such feature or its implemented as such?

Feb 27 '20 18:02 mrityunjay-tripathi

Also, none of the distribution functions generate random samples of the corresponding distribution. That means no sampling function implemented. In PyTorch, we can do something like,

Ok, got it. It is implemented in boostorg/random repository.

Feb 28 '20 05:02 mrityunjay-tripathi

Looks a really promising start.

It would be really nice if the Real defaults could be vector and all other parameters double. This would make it appear to work quite like C, Python, R etc, out of the box.

For testing, \boost\libs\math\test\test_beta_dist.cpp is a slightly useful template. However it is probably over-complicated, at least to get started on testing. I found it easier to start with testing just double values and then refactor later with other types and higher precision. This will at least get some sanity checks.

(Although the usefulness of high values for this distribution is unclear, our policy is to aim for tests at least the precision of 128-bit (~36 decimal digits)).

There is also the important question of test values. Python could be used to generate some test values for a start. https://cran.r-project.org/web/packages/DirichletReg/DirichletReg.pdf might also be slightly useful but is ominously marked as 'orphaned' - but is quite new. Our favorite Wolfram doesn't seem to be able to help.

The beta distribution tests should be applicable when they should be identical, so this might be the first thing to try?

We also like to have some example(s). Being able to show how to combine with Boost.Random to replicate the Wikipedia/Python example(s) would be good.

There are lots of tests of the that check for 'dodgy' input like negative, NaN or inf. These check throwing (if the policy is default). (and don't forget any examples should use thow'n'catch blocks so that the error messages are displayed). You can probably leave me to tidy these up later.

Feb 28 '20 10:02 pabristow

Looks like a great start. I'll go through your other items on your checklist later.

Feb 28 '20 10:02 NAThompson

Hi @pabristow @NAThompson Can we use something like validate_args to see if the user wants to check for argument validity. Currently, we are validating the arguments when an instance of the object is created which is inefficient.

Feb 29 '20 08:02 mrityunjay-tripathi

I have a vague recollection that we had a complaint similar to this many years ago. In the end we decided to keep the apparently redundant check because there were circumstances when it would allow a 'bad' value to sneak in. Checking seems quite cheap, as is construction generally?

Feb 29 '20 10:02 pabristow

Right. If some 'invalid' value comes in for reasons unknown to the user, it would be a disaster.

Feb 29 '20 11:02 mrityunjay-tripathi

@mrityunjay-tripathi Could you write a google benchmark and show how fast it is with and without the argument validation? This is a really hard question to answer without hard data. An example is here.

Feb 29 '20 12:02 NAThompson

@jzmaddock , @pabristow : For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?

Feb 29 '20 13:02 NAThompson

@mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".

Feb 29 '20 13:02 NAThompson

@mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".

This is clear now. I will soon add the tests.

Feb 29 '20 16:02 mrityunjay-tripathi

@mrityunjay-tripathi Could you write a google benchmark and show how fast it is with and without the argument validation? This is a really hard question to answer without hard data. An example is here.

I am not familiar with 'google benchmark', but ready to delve into it. As soon as I add the tests, I will try to write the benchmark as well.

Feb 29 '20 16:02 mrityunjay-tripathi

@jzmaddock , @pabristow : For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?

@jzmaddock devised this scheme - with good reasons - so he is best qualified to comment on this change to multi-thingys.

Feb 29 '20 17:02 pabristow

From: Mrityunjay Tripathi [email protected] Sent: 29 February 2020 16:23 To: boostorg/math [email protected] Cc: Paul A. Bristow [email protected]; Mention [email protected] Subject: Re: [boostorg/math] adding dirichlet distribution (#318)

@mrityunjay-tripathihttps://github.com/mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".

This is clear now. I will soon add the tests.

For distributions a distribution test is a much better starting point to see how the Boost.Test is used..

For example :\boost\libs\math\test\test_normal.cpp

I am looking for some ‘known-good’ values from SciPy, for example, but hopefully you have some already..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/boostorg/math/pull/318?email_source=notifications&email_token=AAIG4AMMMBSY55NF56XX2WDRFE27RA5CNFSM4K5A75T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENL6E4Q#issuecomment-592962162, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIG4AKXZZJNYHXHRGK4KALRFE27RANCNFSM4K5A75TQ.

Feb 29 '20 17:02 pabristow

For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?

For better or worse I'd prefer to keep to the current pattern if possible. Implementation and use wise, I think it makes very little practical difference.

I confess to a certain amount of hesitation in adding something that's not entirely mainstream as our first multivariate distribution. I had always assume that the multivariate normal would come first, but no matter I guess.

With regard to the CDF, I note that multivariate normal has 2 different possible definitions, and I assume that is the case here also. I think we should be careful about treading on shaky ground here. Likewise, I assume there is no real notion of a quantile, not least because there is unlikely to be a unique vector of x-values corresponding to any given probability.

Feb 29 '20 18:02 jzmaddock

I am not familiar with 'google benchmark', but ready to delve into it. As soon as I add the tests, I will try to write the benchmark as well.

You will definitely enjoy adding google benchmark to your toolkit. It's amazing.

Feb 29 '20 18:02 NAThompson

I confess to a certain amount of hesitation in adding something that's not entirely mainstream as our first multivariate distribution. I had always assume that the multivariate normal would come first, but no matter I guess.

I think the problem is that the multivariate normal factors into products so that there is little demand for such a thing. Writing down a useful definition of the cdf that translates into an ergonomic API is going to be a problem though . .

Feb 29 '20 18:02 NAThompson

Likewise, I assume there is no real notion of a quantile, not least because there is unlikely to be a unique vector of x-values corresponding to any given probability.

@jzmaddock I thought of using some random values so that they satisfy the condition. But it seems there is no practical use for this. I was putting it for the sense of completeness of the implementation.

Mar 01 '20 05:03 mrityunjay-tripathi

Writing down a useful definition of the cdf that translates into an ergonomic API is going to be a problem though . .

@NAThompson: I've used this deduction to calculate the CDF.

cdf cdf1

Talking about efficiency, this distribution obviously is computationally expensive.

Mar 01 '20 08:03 mrityunjay-tripathi

./include/boost/math/distributions/dirichlet.hpp:53:20: error: need 'typename' before 'RandomAccessContainer::value_type' because 'RandomAccessContainer' is a dependent scope
   53 |   using RealType = RandomAccessContainer::value_type;
      |                    ^~~~~~~~~~~~~~~~~~~~~
      |                    typename

Also:

./include/boost/math/distributions/dirichlet.hpp:231:12: error: 'm_alpha' was not declared in this scope; did you mean 'alpha'?
  231 |   decltype(m_alpha.size()) &order() const { return m_alpha.size(); }
      |            ^~~~~~~
      |            alpha

Mar 01 '20 17:03 NAThompson

I was trying this sample program from scipy but the values I am getting is not in sync with that of scipy. Not even the 'mean' is correct! What might be the reason?

#include <bits/stdc++.h>
#include <boost/math/distributions/dirichlet.hpp>
using namespace std;

void print(std::vector<long double> v)
{
  for (size_t i = 0; i < v.size(); ++i)
  {
    cout<<v[i]<<" ";
  }
  cout<<endl;
}
int main()
{
  std::vector<long double> alpha = {0.4, 5.0, 15.0};
  std::vector<long double> x = {0.2, 0.2, 0.6};
  
  using VectorType = typename std::vector<long double>;
  
  boost::math::dirichlet_distribution<VectorType> diri(std::move(alpha));
  long double p = pdf(diri, x);
  long double c = cdf(diri, x);
  std::vector<long double> m = mean(diri);
  std::vector<long double> v = variance(diri);
  std::vector<long double> std = standard_deviation(diri);
  long double en = entropy(diri);
  cout<<"pdf: "<<p<<endl;
  cout<<"cdf: "<<c<<endl;
  cout<<"entropy: "<<en<<endl;
  
  cout<<"mean: ";
  print(m);
  cout<<"variance: ";
  print(v);
  cout<<"statndard deviation: ";
  print(std);
}

The output:

pdf: 0.0203131
cdf: 6.90545e-05
entropy: -nan
mean: 0.02 0.25 0.75 
variance: 0.000933333 0.00892857 0.00892857 
statndard deviation: 0.0305505 0.0944911 0.0944911

Mar 02 '20 10:03 mrityunjay-tripathi

Get your unit tests in, make them as simple and as meaningful as possible, and these errors will become obvious to you. Though I think you might be using "moved-from" data; some of your functions take data by move that should be taken by reference.

This is really coming together nicely; good work.

Mar 02 '20 12:03 NAThompson

Ohh. Got it!

RealType sum_alpha() const { return accumulate(m_alpha.begin(), m_alpha.end(), 0);

The initializing value must be RealType. Here I made the mistake of keeping it int.

Mar 02 '20 13:03 mrityunjay-tripathi

Something has gone wrong ;-) (It often does...)

Special cases have always proved good, especially if the result should be 'exact'.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.dirichlet.html

looks recent (Last updated on Dec 19, 2019). so likely to be correct?

After a little light googling, I also found a more recent bit of code in R and some tests that might be helpful?

https://github.com/bertcarnell/dirichlet/blob/7954c574e2bb1359883e8d4b810d266acc6c5a35/tests/testthat/test-dirichlet.r

it("A two parameter Dirichlet is equivalent to a beta", { expect_equal(ddirichlet(c(0.3, 0.7), c(2, 3)), dbeta(0.3, 2, 3))

http://ugrad.stat.ubc.ca/R/library/MCMCpack/html/dirichlet.html might be useful too? Old but might be correct nonetheless?

HTH

Mar 02 '20 17:03 pabristow

After a little light googling, I also found a more recent bit of code in R and some tests that might be helpful? https://github.com/bertcarnell/dirichlet/blob/7954c574e2bb1359883e8d4b810d266acc6c5a35/tests/testthat/test-dirichlet.r

@pabristow: This seems helpful. Thank you.

Mar 02 '20 17:03 mrityunjay-tripathi

BTW, your code is a little verbose:

  std::vector<long double> alpha = {0.4, 5.0, 15.0};
  std::vector<long double> x = {0.2, 0.2, 0.6};
  
  using VectorType = typename std::vector<long double>;
  
  boost::math::dirichlet_distribution<VectorType> diri(std::move(alpha));

Instead:

std::vector<double> alpha = {0.4, 5.0, 15.0};
std::vector<double> x = {0.2, 0.2, 0.6};
auto diri = boost::math::dirichlet_distribution(std::move(alpha));

Note how long double doesn't help since your floating point literals are parsed as doubles, and that the template argument can be deduced by the compiler.

Mar 03 '20 12:03 NAThompson

Note how long double doesn't help since your floating point literals are parsed as doubles, and that the template argument can be deduced by the compiler.

How about using 'L' as suffix for those floating point literals?

Mar 03 '20 14:03 mrityunjay-tripathi

How about using 'L' as suffix for those floating point literals?

Yes, that would fix the issue.

Mar 03 '20 14:03 NAThompson

#include <bits/stdc++.h>
#include <boost/math/distributions/dirichlet.hpp>
using namespace std;

template <typename T>
void print(std::vector<T> v)
{
  for (size_t i = 0; i < v.size(); ++i)
  {
    cout<<fixed<<setprecision(20)<<v[i]<<", ";
  }
  cout<<endl;
}
int main()
{
  using RealType = long double;
  using VectorType = typename std::vector<RealType>;
  
  VectorType alpha = {3.462347646231L, 5.2136450991334L, 15.341L};
  VectorType x = {0.03536252L, 0.28787621346L, 0.676761265L};
  
  boost::math::dirichlet_distribution<VectorType> diri(std::move(alpha));
  RealType p = pdf(diri, x);
  RealType c = cdf(diri, x);
  VectorType m = mean(diri);
  VectorType v = variance(diri);
  VectorType std = standard_deviation(diri);
  RealType en = entropy(diri);
  
  cout<<"alpha: ";
  print(diri.get_alpha());
  cout<<"x: ";
  print(x);
  
  cout<<"pdf: "<<p<<endl;
  cout<<"cdf: "<<c<<endl;
  cout<<"entropy: "<<en<<endl;
  
  cout<<"mean: ";
  print(m);
  cout<<"variance: ";
  print(v);
  cout<<"standard deviation: ";
  print(std);
  cout<<"Normalizing Constant, B: "<<diri.normalizing_constant()<<endl;
  cout<<"Sum of Concentration Parameters, a0: "<<diri.sum_alpha()<<endl;
  cout<<"Sum of quantiles: "<<std::accumulate(x.begin(), x.end(), 0.0)<<endl;

}

The Output:

alpha: 3.46234764623100000009, 5.21364509913340000002, 15.34100000000000000016, 
x: 0.03536252000000000000, 0.28787621346000000000, 0.67676126499999999999, 
pdf: 6.12940449892651902141
cdf: 0.00015248845250954929
entropy: 93.39123000056522383439
mean: 0.14416241379342879160, 0.21708151201151955451, 0.63875607419505165389, 
variance: 0.00493183227490614376, 0.00679366744373401038, 0.00922360070303550949, 
standard deviation: 0.07022700531067905191, 0.08242370680656148689, 0.09603957883620434480, 
Normalizing Constant, B: 0.00000000084752204199
Sum of Concentration Parameters, a0: 24.01699274536439999983
Sum of quantiles: 0.99999999845999998360

I also cross-checked it with scipy. Its giving correct results now. I am almost done with the tests. I will make PR for it probably tomorrow.

Mar 03 '20 14:03 mrityunjay-tripathi

What is the header #include <bits/stdc++.h> doing?

Mar 03 '20 14:03 NAThompson

What is the header #include <bits/stdc++.h> doing?

Nothing xD. I have just the habit of putting it whenever writing a sketch code. (maybe bad practice) In place of it, we can use

#include <iostream>
#include <vector>
#include <iomanip>

Mar 03 '20 14:03 mrityunjay-tripathi

math math copied to clipboard

adding dirichlet distribution

math
math copied to clipboard