chaospy icon indicating copy to clipboard operation
chaospy copied to clipboard

Multivariate Kernel Density Estimation

Open AnnaCraig opened this issue 6 years ago • 19 comments

Hello,

I am trying to use Chaospy to perform advanced sampling of a multivariate KDE generated via sm.nonparametric.KDEMultivariate. Unfortunately, I am not able to defined the KDE as a custom distribution in such a way that I am then able to call the sample operation. Could someone help me understand how I might be able to accomplish this, if it can be done at all?

Any help would be deeply appreciated!

import numpy as np
import statsmodels.api as sm
import chaospy as cp

data = np.array([[ 0.64206175,  0.49193947, -1.47253426],
       [-0.94203536, -1.16952506,  0.99483489],
       [-0.53587827, -0.29164457,  0.6460487 ],
       [ 0.43096139, -0.59982553, -0.01857683],
       [ 0.52384554, -0.54059551,  0.11118488],
       [-0.65789427,  0.15139668,  0.09203062],
       [ 0.2663031 , -0.4434035 , -0.16632009],
       [ 0.08475679, -0.0661517 , -0.34676888],
       [-0.79848711, -1.63556875,  0.52249545],
       [-1.18427302, -1.33508375,  1.12194074],
       [ 0.18819597, -0.84600592, -0.03214548]])

kde = sm.nonparametric.KDEMultivariate(data=data, var_type='ccc', bw='cv_ml')

cp_kde = cp.construct(cdf=lambda self, q: mode1_kde.cdf(q), 
                     bnd=lambda self: [(-1.82392181, 9.81120074),
                                         (-8.63429993, 3.11334478),
                                         ( -4.74632693, 2.08811487)])()

cp_kde.sample(4, 'L')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-9984df9b1b05> in <module>()
     22                                          ( -4.74632693, 2.08811487)])()
     23 
---> 24 cp_kde.sample(4, 'L')

/home/<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in sample(self, size, rule, antithetic, verbose, **kws)
    287         from . import sampler
    288         out = sampler.generator.generate_samples(
--> 289             order=size_, domain=self, rule=rule, antithetic=antithetic)
    290         try:
    291             out = out.reshape(shape)

/home/<>/lib/python2.7/site-packages/chaospy/distributions/sampler/generator.pyc in generate_samples(order, domain, rule, antithetic)
    137     assert rule in SAMPLERS, "rule not recognised"
    138     sampler = SAMPLERS[rule]
--> 139     x_data = trans(sampler(order=order, dim=dim))
    140 
    141     logger.debug("order: %d, dim: %d -> shape: %s", order, dim, x_data.shape)

/home/<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in inv(self, q, maxiter, tol, verbose, **kws)
    207         """
    208         from . import rosenblatt
--> 209         return rosenblatt.inv(self, q, maxiter, tol, **kws)
    210 
    211     def pdf(self, x, step=1e-7, verbose=0):

/home/<>/lib/python2.7/site-packages/chaospy/distributions/rosenblatt.pyc in inv(dist, q, maxiter, tol, **kws)
     36 
     37     try:
---> 38         out, graph = dist.graph.run(q, "inv", maxiter=maxiter, tol=tol)
     39 
     40     except NotImplementedError:

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in run(self, x, mode, **meta)
    196     def run(self, x, mode, **meta):
    197         """Run through network to perform an operator."""
--> 198         return main.call(self, x, mode, **meta)
    199 
    200     def counting(self, dist, mode):

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/main.pyc in call(self, x_data, mode, **meta)
     59         out = self(x_data)
     60     else:
---> 61         out = self(x_data, self.root)
     62 
     63     # if mode == "ttr":

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in __call__(self, *args, **kwargs)
    158 
    159     def __call__(self, *args, **kwargs):
--> 160         return self._call(self, *args, **kwargs)
    161 
    162     def __str__(self):

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/calling/inv.pyc in inv_call(self, q_data, dist)
     33     else:
     34         out, _, _ = approx.ppf(
---> 35             dist, q_data, self, retall=1, **self.meta)
     36     graph.add_node(dist, key=out)
     37 

/home/<>/lib/python2.7/site-packages/chaospy/distributions/approx.pyc in ppf(dist, q, G, maxiter, tol, retall, verbose)
    249 
    250     if not dist.advance:
--> 251         dist.prm, prm = G.K.build(), dist.prm
    252         out = inv(dist, q, maxiter, tol, retall, verbose)
    253         dist.prm = prm

AttributeError: Graph instance has no attribute 'K'

AnnaCraig avatar Apr 24 '18 15:04 AnnaCraig

Hello.

What you are asking for should be doable, but is likely not as straight forward as one would like. However, I do see this as an excellent usecase for Chaospy, so if we can figure out your usecase we are likely to be able to generalize upon it.

Theory

First of all, the name cdf is a bit misleading. It represents the cumulative distribution function only in the univariate case. In the multivariate case it is the forward Rosenblatt transformation. If you are not familiar, in 3 variables, the Rosenblatt R is defined as follows:

p(x, y, z) = p(x) p(y | x) p(z | x, y)
R(x, y, z) = [F(x), F(y | x), F(z | x, y)]

Here p are probability density functions and F are cumulative distribution functions.

For our 3D KDE we then have to construct three KDEs: one univariate, and two conditional ones. I'm not very familiar with statsmodels, but it looks like they support conditional KDE.

Framework

I am currently working on a larger refactor of the distribution module to simplify and improve lots of the bits. Your usecase for example is raising a very uninformative error which could be a lot more descriptive. The core system in the new system is in place, but lots of testing and sanitation is still unfinished, so expect bugs.

You will have to do a checkout of the branch dist_clean to be able to run the code bellow.

Implementation

Here is a proposal:

class KDE(cp.Dist):
    def __init__(self, data):
        cp.Dist.__init__(self)
        self.kdes = [
            sm.nonparametric.KDEMultivariate(
                data=data[:, :1], var_type="c", bw="cv_ml")
        ] + [
            sm.nonparametric.KDEMultivariateConditional(
                endog=[data[:, idx]], exog=list(data[:, :idx].T),
                dep_type='c', indep_type='c'*idx, bw='cv_ml')
            for idx in range(1, data.shape[-1])
        ]
        self.data = data
    def _cdf(self, q):
        out = [self.kdes[0].cdf(q[0])]
        out += [
            self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
            for idx in range(1, len(self))
        ]
        out = np.array(out)
        out = out.reshape(q.shape)
        return out
    def _bnd(self, x):
        return np.min(self.data, 0), np.max(self.data, 0)
    def __len__(self):
        return self.data.shape[-1]

cp_kde = KDE(data)
print(cp_kde.sample(7))

As I noted, I am not too experienced with KDE, so please verify that what I have written makes sense to you. Also, since there are no inverse transformation _ppf, it is instead approximated numerically, so this is by no means a fast method.

jonathf avatar Apr 24 '18 17:04 jonathf

@flo2k since you made the 1D KDE distribution. your thoughts on a multivariate variant version is welcome.

(Also, congrats on the baby.)

jonathf avatar Apr 24 '18 17:04 jonathf

@jonathf I try to have look at the multivariate KDE on Thursday. Maybe we can expand the 1D solution.

flo2k avatar Apr 24 '18 18:04 flo2k

Thank you both for your help and interest! I'm afraid I need to work on something else this afternoon, but I will try to work with the suggested solution on Thursday as well.

I can't tell you how much I appreciate the help!

AnnaCraig avatar Apr 24 '18 18:04 AnnaCraig

Hello,

I am very sorry to bother you about a silly question, but I can't seem to get the code you sent to run:

import chaospy as cp
import statsmodels.api as sm
import numpy as np

class KDE(cp.Dist):
    def __init__(self,data):
        cp.Dist.__init__(self)
        self.kdes = ([sm.nonparametric.KDEMultivariate( data=data[:, :1], var_type="c", bw="cv_ml")]+ 
                     [sm.nonparametric.KDEMultivariateConditional(
                         endog=[data[:, idx]], exog=list(data[:, :idx].T),
                         dep_type='c', indep_type='c'*idx, bw='cv_ml')
                      for idx in range(1, data.shape[-1])
                     ])
        self.data = data
        
    def _cdf(self, q):
        out = [self.kdes[0].cdf(q[0])]
        out += [self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
                for idx in range(1, len(self))]
        out = np.array(out)
        out = out.reshape(q.shape)
        return out
    def _bnd(self, x):
        return np.min(self.data, 0), np.max(self.data, 0)
    
    def __len__(self):
        return self.data.shape[-1]

D = np.array([[ 0.64206175,  0.49193947, -1.47253426],
       [-0.94203536, -1.16952506,  0.99483489],
       [-0.53587827, -0.29164457,  0.6460487 ],
       [ 0.43096139, -0.59982553, -0.01857683],
       [ 0.52384554, -0.54059551,  0.11118488],
       [-0.65789427,  0.15139668,  0.09203062],
       [ 0.2663031 , -0.4434035 , -0.16632009],
       [ 0.08475679, -0.0661517 , -0.34676888],
       [-0.79848711, -1.63556875,  0.52249545],
       [-1.18427302, -1.33508375,  1.12194074],
       [ 0.18819597, -0.84600592, -0.03214548]])

cp_kde = KDE(D)
print(cp_kde.sample(7))
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-55-32fcf2f33ea3> in <module>()
     11        [ 0.18819597, -0.84600592, -0.03214548]])
     12 
---> 13 cp_kde = KDE(D)
     14 print(cp_kde.sample(7))

<ipython-input-54-9090c24341fc> in __init__(self, data)
      5 class KDE(cp.Dist):
      6     def __init__(self,data):
----> 7         cp.Dist.__init__(self)
      8         self.kdes = ([sm.nonparametric.KDEMultivariate( data=data[:, :1], var_type="c", bw="cv_ml")]+ 
      9                      [sm.nonparametric.KDEMultivariateConditional(

/home/<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in __init__(self, **prm)
    109         self.prm = prm.copy()
    110         self.graph = graph.Graph(self)
--> 111         self.dependencies = self.graph.run(self.length, "dep")[0]
    112 
    113     def range(self, x=None, retall=False, verbose=False):

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in run(self, x, mode, **meta)
    196     def run(self, x, mode, **meta):
    197         """Run through network to perform an operator."""
--> 198         return main.call(self, x, mode, **meta)
    199 
    200     def counting(self, dist, mode):

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/main.pyc in call(self, x_data, mode, **meta)
     54 
     55     if mode in ("rnd", "dep"):
---> 56         out = self(self.root)
     57 
     58     elif mode == "val":

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/baseclass.pyc in __call__(self, *args, **kwargs)
    158 
    159     def __call__(self, *args, **kwargs):
--> 160         return self._call(self, *args, **kwargs)
    161 
    162     def __str__(self):

/home/<>/lib/python2.7/site-packages/chaospy/distributions/graph/calling/dep.pyc in dep_call(self, dist)
     10 
     11     if hasattr(dist, "_dep"):
---> 12         out = dist._dep(self)
     13     else:
     14         for val in self.dist.prm.values():

/home<>/lib/python2.7/site-packages/chaospy/distributions/baseclass.pyc in _dep(self, graph)
    378         """
    379         sets = [graph(dist) for dist in graph.dists]
--> 380         if len(self)==1:
    381             out = [set([self])]
    382         else:

<ipython-input-54-9090c24341fc> in __len__(self)
     25 
     26     def __len__(self):
---> 27         return self.data.shape[-1]

AttributeError: 'KDE' object has no attribute 'data'


AnnaCraig avatar Apr 26 '18 12:04 AnnaCraig

Hello @AnnaCraig, I'm sorry, but I'm too busy today with other things. I hope that I can have a look at your things until the end of the week.

Best,

Florian

flo2k avatar Apr 26 '18 13:04 flo2k

Hello Florian,

I am just hugely grateful that you would be taking the time to help at all: thank you again!

Best regards, Anna

AnnaCraig avatar Apr 26 '18 13:04 AnnaCraig

The error you are getting is still using the old backend. I should perhaps have been more explicit with my instructions, so let us try again.

You need to install a version from the branch clean_dist. On Mac/Linux you open a terminal and do the following:

git clone https://github.com/jonathf/chaospy
cd chaospy
git checkout clean_dist
# create/enable virtual environment, if that is your thing
python setup.py install

(Last command might need to be run as super user if you are install system wide.)

jonathf avatar Apr 26 '18 14:04 jonathf

I knew it was a stupid error on my part! It looks like it is running now, so I will work on validation and let you know how it goes.

Thank you again so very much for your help!

Best regards, Anna

AnnaCraig avatar Apr 26 '18 14:04 AnnaCraig

Hello,

Just to follow up on this, the codes seem to be running, but the outputs are not really as good as I would expect. Below are some figures/runtime stats, in case you are interested.

Thank you again for all of your help!

Best regards, Anna

class KDE(cp.Dist):
    def __init__(self,data, bw):
        cp.Dist.__init__(self)
        self.kdes = ([sm.nonparametric.KDEMultivariate( data=data[:, :1], var_type="c", bw=bw[0])]+ 
                     [sm.nonparametric.KDEMultivariateConditional(
                         endog=[data[:, idx]], exog=list(data[:, :idx].T),
                         dep_type='c', indep_type='c'*idx, bw=bw[idx])
                      for idx in range(1, data.shape[-1])
                     ])
        self.data = data
        
    def _cdf(self, q):
        out = [self.kdes[0].cdf(q[0])]
        out += [self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
                for idx in range(1, len(self))]
        out = np.array(out)
        out = out.reshape(q.shape)
        return out
    def _bnd(self, x):
        return np.min(self.data, 0), np.max(self.data, 0)
    
    def __len__(self):
        return self.data.shape[-1]

Splitting a data set with 224267 points into two modes and since one of the variables (y) is obviously bimodal and scaling/transforming the data to be nearly Gaussian distributions before putting the data into the KDE estimator yields the following are outputs:

bw = ['normal_reference', 'normal_reference', 'normal_reference']
Mode 1 fraction: 0.666
Time for mode 1: 0.015s
mode1_kde.kdes[0].bw=[0.07792364]
mode1_kde.kdes[1].bw=[0.10193505 0.11591769]
mode1_kde.kdes[2].bw=[0.14645354 0.1539395  0.13537046]
Mode 2 fraction: 0.365
Time for mode 2: 0.009s
mode2_kde.kdes[0].bw= [0.07216334]
mode2_kde.kdes[1].bw=[0.11197565 0.10521291]
mode2_kde.kdes[2].bw =[0.16749934 0.13773206 0.14658503]

Or keeping the data all together and unscaled/transformed:

bw = ['normal_reference', 'normal_reference', 'normal_reference']
Time: 0.026s
kde.kdes[0].bw=[7.59263522]
kde.kdes[1].bw=[ 0.420837   11.44857361]
kde.kdes[2].bw=[1.42321622e-02 1.53514908e+01 5.64303957e-01]

I then used Latin Hypercube sampling to check that the KDEs were good.... but note the evaluation times.

Sampling time for two-mode, 1000 samples, L method: 6418.209s
Sampling time for all data, 1000 samples, L method: 15198.942s

Plots of the marginal PDFs for the data (blue) and from the samples (orange, sampled in correct fractions for the two-mode case).

x: x y: y z: z

Here I show the samples from each of the two mode KDEs for variable z: z_split

For comparison, fitting the 2 mode data with a multi-variate KDE (using scikitlearn in this case) and drawing 1000 random samples yields:

x: x

y: y

z: z

AnnaCraig avatar May 01 '18 11:05 AnnaCraig

The results are roughly correct, but clearly not accurate enough. Your results tells me the Rosenblatt transformation works in principle, but the accuracy of the approximation is too low. Let us see if we can not get it up a bit.

I have three ideas of what we can do here:

  1. Increase the accuracy of the approximation by adding probability density function through the ._pdf method.
  2. Increase the accuracy by tweaking the convergence tolerance of the approximation method.
  3. Best of all is to provide a point percentile function ._ppf to our KDE class such that the inverse transformation is done without approximation. Neither KDEMultivariate nor KDEMultivariateConditional support this, but KDEUnivariate does through the .icdf method. If we somehow could figure out how to construct a conditional univariate KDE distribution using only KDEUnivariate, we should be golden.

Let us start with (1) and (2). If that fails, we can start assessing if (3) is viable.

Here is some updated code:

class KDE(cp.Dist):
    def __init__(self, data):
        self.data = data
        self.kdes = [
            sm.nonparametric.KDEMultivariate(
                data=data[:, :1], var_type="c", bw="cv_ml")
        ] + [
            sm.nonparametric.KDEMultivariateConditional(
                endog=[data[:, idx]], exog=list(data[:, :idx].T),
                dep_type='c', indep_type='c'*idx, bw='cv_ml')
            for idx in range(1, data.shape[-1])
        ]
        cp.Dist.__init__(self)
    def _cdf(self, q):
        out = [self.kdes[0].cdf(q[0])]
        out += [
            self.kdes[idx].cdf(endog_predict=q[idx], exog_predict=q[:idx].T)
            for idx in range(1, len(self))
        ]
        out = np.array(out)
        out = out.reshape(q.shape)
        return out
    def _pdf(self, q):
        out = [self.kdes[0].pdf(q[0])]
        out += [
            self.kdes[idx].pdf(endog_predict=q[idx], exog_predict=q[:idx].T)
            for idx in range(1, len(self))
        ]
        out = np.array(out)
        out = out.reshape(q.shape)
        return out
    def _bnd(self, x):
        return np.min(self.data, 0), np.max(self.data, 0)
    def __len__(self):
        return self.data.shape[-1]

samples = cp.Iid(cp.Uniform(), 3).sample(7)
print(cp.approximation.approximate_inverse(cp_kde, samples, iterations=100, tol=1e-5))
print(cp_kde.pdf(data[:4]))

The updated class adds a probability density function to the model.

The snippet at the bottom shows you how to evoke the approximate inverse Rosenblatt transformation directly. (It gets evoked automatically when _ppf is not provided). The tol parameter sets how accurate the estimation needs to be. Please experiment with lowering it to see if adjusting tolerance has any effect.

Note that if a warning about the number of iterations is exceeded is printed to screen, increase iteration.

For compare, do you mind making a 1D KDE as well for the x-axis using the samples data[:, :1], and use cp_kde.pdf method to plot the KDE? I want to ensure the problem indeed is indeed located in the approximate inverse transoformation.

jonathf avatar May 01 '18 12:05 jonathf

Wow, thank you so much for all of your help! I honestly didn't expect any reply at all when I first opened this as an issue, but I can't tell you how grateful I am!

I am still working on playing with (1) and (2), but in the mean time, here is the 1D KDE plot you asked for:

features = ['x']
bw = ['normal_reference']
data = properties_data[features]

print(len(data))
print(bw)

t0 = time.time() 
kde_x = KDE(data.values,bw)
t1 = time.time()

print('Time: %.3f' %(t1-t0))
print(kde_x.kdes)

plt.figure()
plt.plot(np.arange(0, 25,0.5), kde_x.pdf(np.arange(0, 25,0.5)),'o')
sns.distplot(data)
224267
['normal_reference']
Time: 0.002
[KDE instance
Number of variables: k_vars = 1
Number of samples:   nobs = 224267
Variable types:      c
BW selection method: normal_reference
]

kde_x

As an aside, if I wanted to use Latin Hypercube sampling, for example, I would change samples = cp.Iid(cp.Uniform(), 3).sample(7) to samples = cp.Iid(cp.Uniform(), 3).sample(7, rule="L")?

Thank you again!

AnnaCraig avatar May 01 '18 18:05 AnnaCraig

You're welcome. As I said, it is an interesting usecase that would be nice to include into the code base.

The picture confirms that the problem most likely lies in the approximate inverse function. If you generated that not using (1), I doubt that adding (1) will have much of an effect. That looks like it is about identical to the the previous x-axis figure.

Yes, rule="L" evokes Latin Hypercube Sampling for you KDE.

jonathf avatar May 01 '18 18:05 jonathf

I missed the part about how horrendously slow it was. I am suspecting it is the accumulation of evaluating KDEMultivariateDependent too often that is causing this.

Scratch what I said about (1) having no effect. It will likely noticeable reduce computation time. And if this all works out, I could likely custom tailor something to reduce it down even further.

jonathf avatar May 01 '18 22:05 jonathf

@AnnaCraig, have had the time to test the new method?

jonathf avatar May 07 '18 22:05 jonathf

Hello,

I apologize for the delay! The HPC computing cluster I am using was unavailable late last week and yesterday I allowed myself to become distracted trying to develop a more quantitative measure of "goodness" of sampling.

After I had read a bit more to try to make sure I wasn't just being stupid, I was actually going to ask for your advice on this. Would you recommend trying to implement the Jensen–Shannon divergence test? Or would a simple RMSE error between marginal KDE estimates based on the full data set vs. sampled data set be sufficient?

Thank you again!

Best regards, Anna

AnnaCraig avatar May 08 '18 10:05 AnnaCraig

I am not to experienced with Jensen-Shannon divergence in practice. But for most intents and purposes RMSE is usually sufficient in my experience.

jonathf avatar May 08 '18 12:05 jonathf

Hello,

I have finally had the chance to come back to work on this! So the Jensen-Shannon test didn't really do what I wanted... because I have so many samples, it was picking up on very minute changes, such that taking more test samples (and visually getting a better marginal KDE match) yielded a worse result.

So now I have defined an area metric as the absolute difference between CDFs of the data and the samples. It seems to be working as a metric, but as I said, there are definitely issues with the z variable not being well-sampled.

I have the files for a full test case which I can share with you, if you wanted me to work with this as a branch in the github repository or send me your email address so I can send you the files.

Thank you again for your help!

Best regards, Anna

AnnaCraig avatar May 22 '18 12:05 AnnaCraig

Okay. Sad to hear that the z-axis is failing.

Feel free to send me the files. jonathf[at]gmail.com. I definitely would want to have a look.

jonathf avatar May 22 '18 19:05 jonathf