math
math copied to clipboard
adding dirichlet distribution
I am adding Dirichlet distribution in include/boost/math/distributions/. The Dirichlet distribution is a multivariate distribution function, so vector type inputs are needed, but the tests in test/compile_test/test_compile_result.hpp are for int, double, float, long double, unsigned. How can I add tests for Dirichlet distribution?
TODO:
- [ ] Add tests.
- [ ] Add google benchmark.
- [x] Decision on
VectorTypeand it's implementation. - [x] Cross check
cdffunction. (I could not find it, so tried to derive it.) - [x] Implementation of
skewnessfunction - [x] Implementation of
kurtosisfunction.
Also, none of the distribution functions generate random samples of the corresponding distribution. That means no sampling function implemented. In PyTorch, we can do something like,
>>> import torch
>>> import torch.distributions as dist
>>> d = dist.dirichlet.Dirichlet(torch.tensor([0.3, 0.3, 0.3]))
>>> d.sample()
tensor([0.7736, 0.1908, 0.0356])
>>> d.rsample(torch.tensor([5]))
tensor([[2.3072e-02, 9.4317e-01, 3.3756e-02],
[1.8339e-01, 1.6803e-03, 8.1493e-01],
[7.3136e-01, 1.9131e-01, 7.7323e-02],
[2.1086e-01, 9.5684e-03, 7.7958e-01],
[9.6672e-06, 4.6919e-02, 9.5307e-01]])
>>>
Can we add such feature or its implemented as such?
Also, none of the distribution functions generate random samples of the corresponding distribution. That means no sampling function implemented. In PyTorch, we can do something like,
Ok, got it. It is implemented in boostorg/random repository.
Looks a really promising start.
It would be really nice if the Real defaults could be vector
For testing, \boost\libs\math\test\test_beta_dist.cpp is a slightly useful template. However it is probably over-complicated, at least to get started on testing. I found it easier to start with testing just double values and then refactor later with other types and higher precision. This will at least get some sanity checks.
(Although the usefulness of high values for this distribution is unclear, our policy is to aim for tests at least the precision of 128-bit (~36 decimal digits)).
There is also the important question of test values. Python could be used to generate some test values for a start. https://cran.r-project.org/web/packages/DirichletReg/DirichletReg.pdf might also be slightly useful but is ominously marked as 'orphaned' - but is quite new. Our favorite Wolfram doesn't seem to be able to help.
The beta distribution tests should be applicable when they should be identical, so this might be the first thing to try?
We also like to have some example(s). Being able to show how to combine with Boost.Random to replicate the Wikipedia/Python example(s) would be good.
There are lots of tests of the that check for 'dodgy' input like negative, NaN or inf. These check throwing (if the policy is default). (and don't forget any examples should use thow'n'catch blocks so that the error messages are displayed). You can probably leave me to tidy these up later.
Looks like a great start. I'll go through your other items on your checklist later.
Hi @pabristow @NAThompson Can we use something like validate_args to see if the user wants to check for argument validity. Currently, we are validating the arguments when an instance of the object is created which is inefficient.
I have a vague recollection that we had a complaint similar to this many years ago. In the end we decided to keep the apparently redundant check because there were circumstances when it would allow a 'bad' value to sneak in. Checking seems quite cheap, as is construction generally?
Right. If some 'invalid' value comes in for reasons unknown to the user, it would be a disaster.
@mrityunjay-tripathi Could you write a google benchmark and show how fast it is with and without the argument validation? This is a really hard question to answer without hard data. An example is here.
@jzmaddock , @pabristow : For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?
@mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".
@mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".
This is clear now. I will soon add the tests.
@mrityunjay-tripathi Could you write a google benchmark and show how fast it is with and without the argument validation? This is a really hard question to answer without hard data. An example is here.
I am not familiar with 'google benchmark', but ready to delve into it. As soon as I add the tests, I will try to write the benchmark as well.
@jzmaddock , @pabristow : For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?
@jzmaddock devised this scheme - with good reasons - so he is best qualified to comment on this change to multi-thingys.
From: Mrityunjay Tripathi [email protected] Sent: 29 February 2020 16:23 To: boostorg/math [email protected] Cc: Paul A. Bristow [email protected]; Mention [email protected] Subject: Re: [boostorg/math] adding dirichlet distribution (#318)
@mrityunjay-tripathihttps://github.com/mrityunjay-tripathi : We have a file "boost/libs/math/test/math_unit_test.hpp", which you can use to write unit tests. An example is "test/whittaker_shannon_test.cpp".
This is clear now. I will soon add the tests.
For distributions a distribution test is a much better starting point to see how the Boost.Test is used..
For example :\boost\libs\math\test\test_normal.cpp
I am looking for some ‘known-good’ values from SciPy, for example, but hopefully you have some already..
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/boostorg/math/pull/318?email_source=notifications&email_token=AAIG4AMMMBSY55NF56XX2WDRFE27RA5CNFSM4K5A75T2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENL6E4Q#issuecomment-592962162, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAIG4AKXZZJNYHXHRGK4KALRFE27RANCNFSM4K5A75TQ.
For scalar distribution, using free functions to compute kurtosis, range, so on, is just fine. But for vector types, following this pattern is getting a bit awkward. Should each of these functions go from free functions acting on the class to member functions?
For better or worse I'd prefer to keep to the current pattern if possible. Implementation and use wise, I think it makes very little practical difference.
I confess to a certain amount of hesitation in adding something that's not entirely mainstream as our first multivariate distribution. I had always assume that the multivariate normal would come first, but no matter I guess.
With regard to the CDF, I note that multivariate normal has 2 different possible definitions, and I assume that is the case here also. I think we should be careful about treading on shaky ground here. Likewise, I assume there is no real notion of a quantile, not least because there is unlikely to be a unique vector of x-values corresponding to any given probability.
I am not familiar with 'google benchmark', but ready to delve into it. As soon as I add the tests, I will try to write the benchmark as well.
You will definitely enjoy adding google benchmark to your toolkit. It's amazing.
I confess to a certain amount of hesitation in adding something that's not entirely mainstream as our first multivariate distribution. I had always assume that the multivariate normal would come first, but no matter I guess.
I think the problem is that the multivariate normal factors into products so that there is little demand for such a thing. Writing down a useful definition of the cdf that translates into an ergonomic API is going to be a problem though . .
Likewise, I assume there is no real notion of a quantile, not least because there is unlikely to be a unique vector of x-values corresponding to any given probability.
@jzmaddock I thought of using some random values so that they satisfy the condition. But it seems there is no practical use for this. I was putting it for the sense of completeness of the implementation.
Writing down a useful definition of the cdf that translates into an ergonomic API is going to be a problem though . .
@NAThompson: I've used this deduction to calculate the CDF.

Talking about efficiency, this distribution obviously is computationally expensive.
./include/boost/math/distributions/dirichlet.hpp:53:20: error: need 'typename' before 'RandomAccessContainer::value_type' because 'RandomAccessContainer' is a dependent scope
53 | using RealType = RandomAccessContainer::value_type;
| ^~~~~~~~~~~~~~~~~~~~~
| typename
Also:
./include/boost/math/distributions/dirichlet.hpp:231:12: error: 'm_alpha' was not declared in this scope; did you mean 'alpha'?
231 | decltype(m_alpha.size()) &order() const { return m_alpha.size(); }
| ^~~~~~~
| alpha
I was trying this sample program from scipy but the values I am getting is not in sync with that of scipy. Not even the 'mean' is correct! What might be the reason?
#include <bits/stdc++.h>
#include <boost/math/distributions/dirichlet.hpp>
using namespace std;
void print(std::vector<long double> v)
{
for (size_t i = 0; i < v.size(); ++i)
{
cout<<v[i]<<" ";
}
cout<<endl;
}
int main()
{
std::vector<long double> alpha = {0.4, 5.0, 15.0};
std::vector<long double> x = {0.2, 0.2, 0.6};
using VectorType = typename std::vector<long double>;
boost::math::dirichlet_distribution<VectorType> diri(std::move(alpha));
long double p = pdf(diri, x);
long double c = cdf(diri, x);
std::vector<long double> m = mean(diri);
std::vector<long double> v = variance(diri);
std::vector<long double> std = standard_deviation(diri);
long double en = entropy(diri);
cout<<"pdf: "<<p<<endl;
cout<<"cdf: "<<c<<endl;
cout<<"entropy: "<<en<<endl;
cout<<"mean: ";
print(m);
cout<<"variance: ";
print(v);
cout<<"statndard deviation: ";
print(std);
}
The output:
pdf: 0.0203131
cdf: 6.90545e-05
entropy: -nan
mean: 0.02 0.25 0.75
variance: 0.000933333 0.00892857 0.00892857
statndard deviation: 0.0305505 0.0944911 0.0944911
Get your unit tests in, make them as simple and as meaningful as possible, and these errors will become obvious to you. Though I think you might be using "moved-from" data; some of your functions take data by move that should be taken by reference.
This is really coming together nicely; good work.
Ohh. Got it!
RealType sum_alpha() const { return accumulate(m_alpha.begin(), m_alpha.end(), 0);
The initializing value must be RealType. Here I made the mistake of keeping it int.
Something has gone wrong ;-) (It often does...)
Special cases have always proved good, especially if the result should be 'exact'.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.dirichlet.html
looks recent (Last updated on Dec 19, 2019). so likely to be correct?
After a little light googling, I also found a more recent bit of code in R and some tests that might be helpful?
https://github.com/bertcarnell/dirichlet/blob/7954c574e2bb1359883e8d4b810d266acc6c5a35/tests/testthat/test-dirichlet.r
it("A two parameter Dirichlet is equivalent to a beta", { expect_equal(ddirichlet(c(0.3, 0.7), c(2, 3)), dbeta(0.3, 2, 3))
http://ugrad.stat.ubc.ca/R/library/MCMCpack/html/dirichlet.html might be useful too? Old but might be correct nonetheless?
HTH
After a little light googling, I also found a more recent bit of code in R and some tests that might be helpful? https://github.com/bertcarnell/dirichlet/blob/7954c574e2bb1359883e8d4b810d266acc6c5a35/tests/testthat/test-dirichlet.r
@pabristow: This seems helpful. Thank you.
BTW, your code is a little verbose:
std::vector<long double> alpha = {0.4, 5.0, 15.0};
std::vector<long double> x = {0.2, 0.2, 0.6};
using VectorType = typename std::vector<long double>;
boost::math::dirichlet_distribution<VectorType> diri(std::move(alpha));
Instead:
std::vector<double> alpha = {0.4, 5.0, 15.0};
std::vector<double> x = {0.2, 0.2, 0.6};
auto diri = boost::math::dirichlet_distribution(std::move(alpha));
Note how long double doesn't help since your floating point literals are parsed as doubles, and that the template argument can be deduced by the compiler.
Note how long double doesn't help since your floating point literals are parsed as doubles, and that the template argument can be deduced by the compiler.
How about using 'L' as suffix for those floating point literals?
How about using 'L' as suffix for those floating point literals?
Yes, that would fix the issue.
#include <bits/stdc++.h>
#include <boost/math/distributions/dirichlet.hpp>
using namespace std;
template <typename T>
void print(std::vector<T> v)
{
for (size_t i = 0; i < v.size(); ++i)
{
cout<<fixed<<setprecision(20)<<v[i]<<", ";
}
cout<<endl;
}
int main()
{
using RealType = long double;
using VectorType = typename std::vector<RealType>;
VectorType alpha = {3.462347646231L, 5.2136450991334L, 15.341L};
VectorType x = {0.03536252L, 0.28787621346L, 0.676761265L};
boost::math::dirichlet_distribution<VectorType> diri(std::move(alpha));
RealType p = pdf(diri, x);
RealType c = cdf(diri, x);
VectorType m = mean(diri);
VectorType v = variance(diri);
VectorType std = standard_deviation(diri);
RealType en = entropy(diri);
cout<<"alpha: ";
print(diri.get_alpha());
cout<<"x: ";
print(x);
cout<<"pdf: "<<p<<endl;
cout<<"cdf: "<<c<<endl;
cout<<"entropy: "<<en<<endl;
cout<<"mean: ";
print(m);
cout<<"variance: ";
print(v);
cout<<"standard deviation: ";
print(std);
cout<<"Normalizing Constant, B: "<<diri.normalizing_constant()<<endl;
cout<<"Sum of Concentration Parameters, a0: "<<diri.sum_alpha()<<endl;
cout<<"Sum of quantiles: "<<std::accumulate(x.begin(), x.end(), 0.0)<<endl;
}
The Output:
alpha: 3.46234764623100000009, 5.21364509913340000002, 15.34100000000000000016,
x: 0.03536252000000000000, 0.28787621346000000000, 0.67676126499999999999,
pdf: 6.12940449892651902141
cdf: 0.00015248845250954929
entropy: 93.39123000056522383439
mean: 0.14416241379342879160, 0.21708151201151955451, 0.63875607419505165389,
variance: 0.00493183227490614376, 0.00679366744373401038, 0.00922360070303550949,
standard deviation: 0.07022700531067905191, 0.08242370680656148689, 0.09603957883620434480,
Normalizing Constant, B: 0.00000000084752204199
Sum of Concentration Parameters, a0: 24.01699274536439999983
Sum of quantiles: 0.99999999845999998360
I also cross-checked it with scipy. Its giving correct results now. I am almost done with the tests. I will make PR for it probably tomorrow.
What is the header #include <bits/stdc++.h> doing?
What is the header #include <bits/stdc++.h> doing?
Nothing xD. I have just the habit of putting it whenever writing a sketch code. (maybe bad practice) In place of it, we can use
#include <iostream>
#include <vector>
#include <iomanip>