KDEpy Variable bandwidth for 3 dimensional data.

Variable bandwidth for 3 dimensional data.

Open ytarricq opened this issue 4 years ago • 6 comments

Hello,

First of all, thanks for the great package. I'm trying to compute density maps of a 3 dimensional points distribution. I understood from the documentation that a variable bandwith method was available but I couldn't figure out how to set up this option. Additionnaly, in the case of a fixed bandwidth KDE for multidimensional data, I would have expected as in the stats_models_multivariateKDE implementation to be able to use a bandwidth per dimension but it seems that we can either use a single value of the bandwidth or to use one bandwidth per data point. Is it in order to take into account the weight of each data point that you implemented it this way ?

Thanks in advance.

Cheers Yoann

Jul 19 '19 13:07 ytarricq

Thanks for the kind words, and for raising this issue @ytarricq .

Variable bandwidth (i.e. a unique bandwidth per data point) is only available in the NaiveKDE and TreeKDE implementations. You have to supply an array as the bw parameter, see the docs here.
If you want to use a bandwidth matrix that depends on the dimensions, i.e. bandwidth 2 in the x direction and bandwidth 3 in the y dimension, that's not supported directly. The reason is that every kernel is implement as a radial basis function. Scipy supports arbitrary bandwidth matrices, which is easy for Gaussian kernels. KDEpy supports arbitrary kernels, which makes this tougher. There's an elegant workaround to this problem: use the SVD to transform the data (scale, rotate) instead, see my recipe here.

Hope this helps you. :+1:

Making that recipe idiot-proof and implementing it in the main library would be a good task. If you (or anyone reading this) is up for it, that's a PR I would merge.

Jul 19 '19 15:07 tommyod

Thanks for the quick answer ! Made things clearer between the bandwidth matrices/variable bandwidths. I will work on the best way to handle my data and will get back at you if I'm successfull.

Jul 22 '19 09:07 ytarricq

I'm having the same issue with the fixed bandwidth for all dimensions. In my case one dimension has a radically different scale than the others and hence the resulting KDEs don't look good. Just scaling the data in that dimension works fine (and rescaling after KDE), since I don't need any rotation/covariance. Wouldn't implementing a bw per dimensions, if given as an iterable, not get us a long way without complicating things too much?

Mar 24 '20 10:03 philippeller

@philippeller : Since the kernel functions are radial basis functions, I suppose your suggestion would amount to scaling the input data in each dimension, computing the KDE, then scaling back. However, it would hide how the data is scaled from the user. Some options are: min/max scaling, standardizing with the standard deviation and the mean, quantile transformations, etc.

I feel that "simple is better than complex" and the "principle of least surprise" applies here. Doing some implicit scaling scheme might confuse users more than it helps them. Stating that "the multidimensional KDE is isotropic" and letting users handle scaling seems simpler to understand and less likely to produce unexpected results.

I'm open to suggestions of course. But I would need some details. A high-level wrapper function, or a ScalingTransformer class might be sensible.

Mar 24 '20 11:03 tommyod

Maybe I'm missing some important point, but I was thinking not an implicit, but an explicit scaling.

Let's say the user supplies two dimensional data (x and y) and the fixed bandwidths as (bw_x, bw_y). Now internally you compute scaled_x = x / bw_x( and scaled_y = y / bw_y), then proceed with the KDE on the scaled data using bandwidth = 1, and in the end just undo the scaling, wouldn't that work?

Mar 24 '20 11:03 philippeller

That would work. :+1: Thanks for clarifying. I got a little ahead of myself.

What you're sketching might be worth implementing. In a different issue #6 we had some discussions about a more general case. It's really an issue of API design. The way I see it:

Pros:

Useful in some use cases, saves the users some time (but they can do it themselves too)

Cons:

Extends the current API a little (but backwards compatible, so no big deal)
Doesn't implement the more general case (general anisotropic KDEs via rotations)

In conclusion I would merge a PR that implements this. :+1: No promises about when/if I'll find time to do it myself though.

Mar 24 '20 12:03 tommyod

KDEpy KDEpy copied to clipboard

Variable bandwidth for 3 dimensional data.

KDEpy
KDEpy copied to clipboard