chaospy
chaospy copied to clipboard
Multivariate Kernel Density Estimation
Hello,
I am trying to use Chaospy to perform advanced sampling of a multivariate KDE generated via sm.nonparametric.KDEMultivariate. Unfortunately, I am not able to defined the KDE as a custom distribution in such a way that I am then able to call the sample operation. Could someone help me understand how I might be able to accomplish this, if it can be done at all?
Any help would be deeply appreciated!
import numpy as np
import statsmodels.api as sm
import chaospy as cp
data = np.array([[ 0.64206175, 0.49193947, -1.47253426],
[-0.94203536, -1.16952506, 0.99483489],
[-0.53587827, -0.29164457, 0.6460487 ],
[ 0.43096139, -0.59982553, -0.01857683],
[ 0.52384554, -0.54059551, 0.11118488],
[-0.65789427, 0.15139668, 0.09203062],
[ 0.2663031 , -0.4434035 , -0.16632009],
[ 0.08475679, -0.0661517 , -0.34676888],
[-0.79848711, -1.63556875, 0.52249545],
[-1.18427302, -1.33508375, 1.12194074],
[ 0.18819597, -0.84600592, -0.03214548]])
kde = sm.nonparametric.KDEMultivariate(data=data, var_type='ccc', bw='cv_ml')
cp_kde = cp.construct(cdf=lambda self, q: mode1_kde.cdf(q),
bnd=lambda self: [(-1.82392181, 9.81120074),
(-8.63429993, 3.11334478),
( -4.74632693, 2.08811487)])()
cp_kde.sample(4, 'L')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-9984df9b1b05> in <module>()
22 ( -4.74632693, 2.08811487)])()
23
---> 24 cp_kde.sample(4, 'L')
/home/<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in sample(self, size, rule, antithetic, verbose, **kws)
287 from . import sampler
288 out = sampler.generator.generate_samples(
--> 289 order=size_, domain=self, rule=rule, antithetic=antithetic)
290 try:
291 out = out.reshape(shape)
/home/<>/lib/python2.7/site-packages/chaospy/distributions/sampler/generator.pyc in generate_samples(order, domain, rule, antithetic)
137 assert rule in SAMPLERS, "rule not recognised"
138 sampler = SAMPLERS[rule]
--> 139 x_data = trans(sampler(order=order, dim=dim))
140
141 logger.debug("order: %d, dim: %d -> shape: %s", order, dim, x_data.shape)
/home/<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in inv(self, q, maxiter, tol, verbose, **kws)
207 """
208 from . import rosenblatt
--> 209 return rosenblatt.inv(self, q, maxiter, tol, **kws)
210
211 def pdf(self, x, step=1e-7, verbose=0):
/home/<>/lib/python2.7/site-packages/chaospy/distributions/rosenblatt.pyc in inv(dist, q, maxiter, tol, **kws)
36
37 try:
---> 38 out, graph = dist.graph.run(q, "inv", maxiter=maxiter, tol=tol)
39
40 except NotImplementedError:
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in run(self, x, mode, **meta)
196 def run(self, x, mode, **meta):
197 """Run through network to perform an operator."""
--> 198 return main.call(self, x, mode, **meta)
199
200 def counting(self, dist, mode):
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/main.pyc in call(self, x_data, mode, **meta)
59 out = self(x_data)
60 else:
---> 61 out = self(x_data, self.root)
62
63 # if mode == "ttr":
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in __call__(self, *args, **kwargs)
158
159 def __call__(self, *args, **kwargs):
--> 160 return self._call(self, *args, **kwargs)
161
162 def __str__(self):
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/calling/inv.pyc in inv_call(self, q_data, dist)
33 else:
34 out, _, _ = approx.ppf(
---> 35 dist, q_data, self, retall=1, **self.meta)
36 graph.add_node(dist, key=out)
37
/home/<>/lib/python2.7/site-packages/chaospy/distributions/approx.pyc in ppf(dist, q, G, maxiter, tol, retall, verbose)
249
250 if not dist.advance:
--> 251 dist.prm, prm = G.K.build(), dist.prm
252 out = inv(dist, q, maxiter, tol, retall, verbose)
253 dist.prm = prm
AttributeError: Graph instance has no attribute 'K'
Hello.
What you are asking for should be doable, but is likely not as straight forward as one would like. However, I do see this as an excellent usecase for Chaospy, so if we can figure out your usecase we are likely to be able to generalize upon it.
Theory
First of all, the name cdf
is a bit misleading. It represents the cumulative distribution function only in the univariate case. In the multivariate case it is the forward Rosenblatt transformation. If you are not familiar, in 3 variables, the Rosenblatt R
is defined as follows:
p(x, y, z) = p(x) p(y | x) p(z | x, y)
R(x, y, z) = [F(x), F(y | x), F(z | x, y)]
Here p
are probability density functions and F
are cumulative distribution functions.
For our 3D KDE we then have to construct three KDEs: one univariate, and two conditional ones. I'm not very familiar with statsmodels
, but it looks like they support conditional KDE.
Framework
I am currently working on a larger refactor of the distribution module to simplify and improve lots of the bits. Your usecase for example is raising a very uninformative error which could be a lot more descriptive. The core system in the new system is in place, but lots of testing and sanitation is still unfinished, so expect bugs.
You will have to do a checkout of the branch dist_clean
to be able to run the code bellow.
Implementation
Here is a proposal:
class KDE(cp.Dist):
def __init__(self, data):
cp.Dist.__init__(self)
self.kdes = [
sm.nonparametric.KDEMultivariate(
data=data[:, :1], var_type="c", bw="cv_ml")
] + [
sm.nonparametric.KDEMultivariateConditional(
endog=[data[:, idx]], exog=list(data[:, :idx].T),
dep_type='c', indep_type='c'*idx, bw='cv_ml')
for idx in range(1, data.shape[-1])
]
self.data = data
def _cdf(self, q):
out = [self.kdes[0].cdf(q[0])]
out += [
self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
for idx in range(1, len(self))
]
out = np.array(out)
out = out.reshape(q.shape)
return out
def _bnd(self, x):
return np.min(self.data, 0), np.max(self.data, 0)
def __len__(self):
return self.data.shape[-1]
cp_kde = KDE(data)
print(cp_kde.sample(7))
As I noted, I am not too experienced with KDE, so please verify that what I have written makes sense to you.
Also, since there are no inverse transformation _ppf
, it is instead approximated numerically, so this is by no means a fast method.
@flo2k since you made the 1D KDE distribution. your thoughts on a multivariate variant version is welcome.
(Also, congrats on the baby.)
@jonathf I try to have look at the multivariate KDE on Thursday. Maybe we can expand the 1D solution.
Thank you both for your help and interest! I'm afraid I need to work on something else this afternoon, but I will try to work with the suggested solution on Thursday as well.
I can't tell you how much I appreciate the help!
Hello,
I am very sorry to bother you about a silly question, but I can't seem to get the code you sent to run:
import chaospy as cp
import statsmodels.api as sm
import numpy as np
class KDE(cp.Dist):
def __init__(self,data):
cp.Dist.__init__(self)
self.kdes = ([sm.nonparametric.KDEMultivariate( data=data[:, :1], var_type="c", bw="cv_ml")]+
[sm.nonparametric.KDEMultivariateConditional(
endog=[data[:, idx]], exog=list(data[:, :idx].T),
dep_type='c', indep_type='c'*idx, bw='cv_ml')
for idx in range(1, data.shape[-1])
])
self.data = data
def _cdf(self, q):
out = [self.kdes[0].cdf(q[0])]
out += [self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
for idx in range(1, len(self))]
out = np.array(out)
out = out.reshape(q.shape)
return out
def _bnd(self, x):
return np.min(self.data, 0), np.max(self.data, 0)
def __len__(self):
return self.data.shape[-1]
D = np.array([[ 0.64206175, 0.49193947, -1.47253426],
[-0.94203536, -1.16952506, 0.99483489],
[-0.53587827, -0.29164457, 0.6460487 ],
[ 0.43096139, -0.59982553, -0.01857683],
[ 0.52384554, -0.54059551, 0.11118488],
[-0.65789427, 0.15139668, 0.09203062],
[ 0.2663031 , -0.4434035 , -0.16632009],
[ 0.08475679, -0.0661517 , -0.34676888],
[-0.79848711, -1.63556875, 0.52249545],
[-1.18427302, -1.33508375, 1.12194074],
[ 0.18819597, -0.84600592, -0.03214548]])
cp_kde = KDE(D)
print(cp_kde.sample(7))
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-55-32fcf2f33ea3> in <module>()
11 [ 0.18819597, -0.84600592, -0.03214548]])
12
---> 13 cp_kde = KDE(D)
14 print(cp_kde.sample(7))
<ipython-input-54-9090c24341fc> in __init__(self, data)
5 class KDE(cp.Dist):
6 def __init__(self,data):
----> 7 cp.Dist.__init__(self)
8 self.kdes = ([sm.nonparametric.KDEMultivariate( data=data[:, :1], var_type="c", bw="cv_ml")]+
9 [sm.nonparametric.KDEMultivariateConditional(
/home/<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in __init__(self, **prm)
109 self.prm = prm.copy()
110 self.graph = graph.Graph(self)
--> 111 self.dependencies = self.graph.run(self.length, "dep")[0]
112
113 def range(self, x=None, retall=False, verbose=False):
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in run(self, x, mode, **meta)
196 def run(self, x, mode, **meta):
197 """Run through network to perform an operator."""
--> 198 return main.call(self, x, mode, **meta)
199
200 def counting(self, dist, mode):
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/main.pyc in call(self, x_data, mode, **meta)
54
55 if mode in ("rnd", "dep"):
---> 56 out = self(self.root)
57
58 elif mode == "val":
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in __call__(self, *args, **kwargs)
158
159 def __call__(self, *args, **kwargs):
--> 160 return self._call(self, *args, **kwargs)
161
162 def __str__(self):
/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/calling/dep.pyc in dep_call(self, dist)
10
11 if hasattr(dist, "_dep"):
---> 12 out = dist._dep(self)
13 else:
14 for val in self.dist.prm.values():
/home<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in _dep(self, graph)
378 """
379 sets = [graph(dist) for dist in graph.dists]
--> 380 if len(self)==1:
381 out = [set([self])]
382 else:
<ipython-input-54-9090c24341fc> in __len__(self)
25
26 def __len__(self):
---> 27 return self.data.shape[-1]
AttributeError: 'KDE' object has no attribute 'data'
Hello @AnnaCraig, I'm sorry, but I'm too busy today with other things. I hope that I can have a look at your things until the end of the week.
Best,
Florian
Hello Florian,
I am just hugely grateful that you would be taking the time to help at all: thank you again!
Best regards, Anna
The error you are getting is still using the old backend. I should perhaps have been more explicit with my instructions, so let us try again.
You need to install a version from the branch clean_dist
. On Mac/Linux you open a terminal and do the following:
git clone https://github.com/jonathf/chaospy
cd chaospy
git checkout clean_dist
# create/enable virtual environment, if that is your thing
python setup.py install
(Last command might need to be run as super user if you are install system wide.)
I knew it was a stupid error on my part! It looks like it is running now, so I will work on validation and let you know how it goes.
Thank you again so very much for your help!
Best regards, Anna
Hello,
Just to follow up on this, the codes seem to be running, but the outputs are not really as good as I would expect. Below are some figures/runtime stats, in case you are interested.
Thank you again for all of your help!
Best regards, Anna
class KDE(cp.Dist):
def __init__(self,data, bw):
cp.Dist.__init__(self)
self.kdes = ([sm.nonparametric.KDEMultivariate( data=data[:, :1], var_type="c", bw=bw[0])]+
[sm.nonparametric.KDEMultivariateConditional(
endog=[data[:, idx]], exog=list(data[:, :idx].T),
dep_type='c', indep_type='c'*idx, bw=bw[idx])
for idx in range(1, data.shape[-1])
])
self.data = data
def _cdf(self, q):
out = [self.kdes[0].cdf(q[0])]
out += [self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
for idx in range(1, len(self))]
out = np.array(out)
out = out.reshape(q.shape)
return out
def _bnd(self, x):
return np.min(self.data, 0), np.max(self.data, 0)
def __len__(self):
return self.data.shape[-1]
Splitting a data set with 224267 points into two modes and since one of the variables (y) is obviously bimodal and scaling/transforming the data to be nearly Gaussian distributions before putting the data into the KDE estimator yields the following are outputs:
bw = ['normal_reference', 'normal_reference', 'normal_reference']
Mode 1 fraction: 0.666
Time for mode 1: 0.015s
mode1_kde.kdes[0].bw=[0.07792364]
mode1_kde.kdes[1].bw=[0.10193505 0.11591769]
mode1_kde.kdes[2].bw=[0.14645354 0.1539395 0.13537046]
Mode 2 fraction: 0.365
Time for mode 2: 0.009s
mode2_kde.kdes[0].bw= [0.07216334]
mode2_kde.kdes[1].bw=[0.11197565 0.10521291]
mode2_kde.kdes[2].bw =[0.16749934 0.13773206 0.14658503]
Or keeping the data all together and unscaled/transformed:
bw = ['normal_reference', 'normal_reference', 'normal_reference']
Time: 0.026s
kde.kdes[0].bw=[7.59263522]
kde.kdes[1].bw=[ 0.420837 11.44857361]
kde.kdes[2].bw=[1.42321622e-02 1.53514908e+01 5.64303957e-01]
I then used Latin Hypercube sampling to check that the KDEs were good.... but note the evaluation times.
Sampling time for two-mode, 1000 samples, L method: 6418.209s
Sampling time for all data, 1000 samples, L method: 15198.942s
Plots of the marginal PDFs for the data (blue) and from the samples (orange, sampled in correct fractions for the two-mode case).
x: y: z:
Here I show the samples from each of the two mode KDEs for variable z:
For comparison, fitting the 2 mode data with a multi-variate KDE (using scikitlearn in this case) and drawing 1000 random samples yields:
x:
y:
z:
The results are roughly correct, but clearly not accurate enough. Your results tells me the Rosenblatt transformation works in principle, but the accuracy of the approximation is too low. Let us see if we can not get it up a bit.
I have three ideas of what we can do here:
- Increase the accuracy of the approximation by adding probability density function through the
._pdf
method. - Increase the accuracy by tweaking the convergence tolerance of the approximation method.
- Best of all is to provide a point percentile function
._ppf
to ourKDE
class such that the inverse transformation is done without approximation. NeitherKDEMultivariate
norKDEMultivariateConditional
support this, butKDEUnivariate
does through the.icdf
method. If we somehow could figure out how to construct a conditional univariate KDE distribution using onlyKDEUnivariate
, we should be golden.
Let us start with (1) and (2). If that fails, we can start assessing if (3) is viable.
Here is some updated code:
class KDE(cp.Dist):
def __init__(self, data):
self.data = data
self.kdes = [
sm.nonparametric.KDEMultivariate(
data=data[:, :1], var_type="c", bw="cv_ml")
] + [
sm.nonparametric.KDEMultivariateConditional(
endog=[data[:, idx]], exog=list(data[:, :idx].T),
dep_type='c', indep_type='c'*idx, bw='cv_ml')
for idx in range(1, data.shape[-1])
]
cp.Dist.__init__(self)
def _cdf(self, q):
out = [self.kdes[0].cdf(q[0])]
out += [
self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
for idx in range(1, len(self))
]
out = np.array(out)
out = out.reshape(q.shape)
return out
def _pdf(self, q):
out = [self.kdes[0].pdf(q[0])]
out += [
self.kdes[idx].pdf(endog_predict=q[idx], exog_predict=q[:idx].T)
for idx in range(1, len(self))
]
out = np.array(out)
out = out.reshape(q.shape)
return out
def _bnd(self, x):
return np.min(self.data, 0), np.max(self.data, 0)
def __len__(self):
return self.data.shape[-1]
samples = cp.Iid(cp.Uniform(), 3).sample(7)
print(cp.approximation.approximate_inverse(cp_kde, samples, iterations=100, tol=1e-5))
print(cp_kde.pdf(data[:4]))
The updated class adds a probability density function to the model.
The snippet at the bottom shows you how to evoke the approximate inverse Rosenblatt transformation directly. (It gets evoked automatically when _ppf
is not provided). The tol
parameter sets how accurate the estimation needs to be. Please experiment with lowering it to see if adjusting tolerance has any effect.
Note that if a warning about the number of iterations is exceeded is printed to screen, increase iteration
.
For compare, do you mind making a 1D KDE as well for the x-axis using the samples data[:, :1]
, and use cp_kde.pdf
method to plot the KDE? I want to ensure the problem indeed is indeed located in the approximate inverse transoformation.
Wow, thank you so much for all of your help! I honestly didn't expect any reply at all when I first opened this as an issue, but I can't tell you how grateful I am!
I am still working on playing with (1) and (2), but in the mean time, here is the 1D KDE plot you asked for:
features = ['x']
bw = ['normal_reference']
data = properties_data[features]
print(len(data))
print(bw)
t0 = time.time()
kde_x = KDE(data.values,bw)
t1 = time.time()
print('Time: %.3f' %(t1-t0))
print(kde_x.kdes)
plt.figure()
plt.plot(np.arange(0, 25,0.5), kde_x.pdf(np.arange(0, 25,0.5)),'o')
sns.distplot(data)
224267
['normal_reference']
Time: 0.002
[KDE instance
Number of variables: k_vars = 1
Number of samples: nobs = 224267
Variable types: c
BW selection method: normal_reference
]
As an aside, if I wanted to use Latin Hypercube sampling, for example, I would change samples = cp.Iid(cp.Uniform(), 3).sample(7)
to samples = cp.Iid(cp.Uniform(), 3).sample(7, rule="L")
?
Thank you again!
You're welcome. As I said, it is an interesting usecase that would be nice to include into the code base.
The picture confirms that the problem most likely lies in the approximate inverse function. If you generated that not using (1), I doubt that adding (1) will have much of an effect. That looks like it is about identical to the the previous x-axis figure.
Yes, rule="L"
evokes Latin Hypercube Sampling for you KDE.
I missed the part about how horrendously slow it was. I am suspecting it is the accumulation of evaluating KDEMultivariateDependent
too often that is causing this.
Scratch what I said about (1) having no effect. It will likely noticeable reduce computation time. And if this all works out, I could likely custom tailor something to reduce it down even further.
@AnnaCraig, have had the time to test the new method?
Hello,
I apologize for the delay! The HPC computing cluster I am using was unavailable late last week and yesterday I allowed myself to become distracted trying to develop a more quantitative measure of "goodness" of sampling.
After I had read a bit more to try to make sure I wasn't just being stupid, I was actually going to ask for your advice on this. Would you recommend trying to implement the Jensen–Shannon divergence test? Or would a simple RMSE error between marginal KDE estimates based on the full data set vs. sampled data set be sufficient?
Thank you again!
Best regards, Anna
I am not to experienced with Jensen-Shannon divergence in practice. But for most intents and purposes RMSE is usually sufficient in my experience.
Hello,
I have finally had the chance to come back to work on this! So the Jensen-Shannon test didn't really do what I wanted... because I have so many samples, it was picking up on very minute changes, such that taking more test samples (and visually getting a better marginal KDE match) yielded a worse result.
So now I have defined an area metric as the absolute difference between CDFs of the data and the samples. It seems to be working as a metric, but as I said, there are definitely issues with the z variable not being well-sampled.
I have the files for a full test case which I can share with you, if you wanted me to work with this as a branch in the github repository or send me your email address so I can send you the files.
Thank you again for your help!
Best regards, Anna
Okay. Sad to hear that the z-axis is failing.
Feel free to send me the files. jonathf[at]gmail.com. I definitely would want to have a look.