KDEpy icon indicating copy to clipboard operation
KDEpy copied to clipboard

Bandwidth selection for 2d data

Open JHWu92 opened this issue 3 years ago • 5 comments

Hi,

How can I apply automatic bandwidth selection for 2d data? I tried 'silverman', 'scott', 'ISJ' and they all say "rule is only available for 1D data".

I also think about cross-validation but I don't know what and how to compute some error function for grid search. Any suggestion?

Best, Jeff

JHWu92 avatar Nov 29 '20 20:11 JHWu92

Hi Jeff,

This is not implemented, since it's not obvious (to me) how to generalize bandwidth to several dimensions for all kernels in an elegant way. In higher dimensions, the bandwidth is a matrix and not a number. That being said, you could scale the data and use bw=1, and it would be equivalent. Here is how it would work in 1D first:

from KDEpy.bw_selection import silvermans_rule
import KDEpy
import numpy as np

np.random.seed(1)

# 1D example
data = np.random.randn(100, 1)

# Compute the BW, scale, then scale back
bw = silvermans_rule(data[:, [0]])
data_scaled = data / np.array([bw])
x_scaled, y_scaled = KDEpy.FFTKDE(bw=1).fit(data_scaled).evaluate()
x = x_scaled * np.array([bw])
y = y_scaled / (bw)

# Use Silverman directly without scaling
x2, y2 = KDEpy.FFTKDE(bw="silverman").fit(data).evaluate()
plt.plot(x, y, label="scaling_method"); plt.plot(x2, y2, label="direct method");plt.legend();

image

If you are comfortable with a diagonal bandwidth matrix (scaling along each axis independently), you could generalize this to 2D like so:

from KDEpy.bw_selection import silvermans_rule
import KDEpy
import numpy as np

np.random.seed(1)

# 2D example
data = np.random.randn(100, 2)

# Compute the BW, scale, then scale back
bw1 = silvermans_rule(data[:, [0]])
bw2 = silvermans_rule(data[:, [1]])
data_scaled = data / np.array([bw1, bw2])
x_scaled, y_scaled = KDEpy.FFTKDE(bw=1).fit(data_scaled).evaluate((32, 32))
x = x_scaled * np.array([bw1, bw2])
y = y_scaled / (bw1 * bw2)

# Verify that the integral is 1
dx = (x[-1, 0] - x[0, 0] ) / (32 - 1) # 32 grid points by default in each direction
dy = (x[-1, 1] - x[0, 1] ) / (32 - 1)
np.sum(y * dx * dy) # 0.9999712583635577

If you want a full bandwidth matrix with off diagonal entries not equal to zero, you could scale and rotate the input data (using principal components), fit the bw=1 kernel, then scale and rotate back using the inverse transform. See this discussion for more details. Hopefully the above is good enough though.

I appreciate thoughts and questions on this issue. Basically multidimensional kernels either have the bandwidth matrix H equal to either (1) identity * h, (2) diagonal H (as seen above), or (3) non-diagonal H (PCA). Maybe I could implemented it in an elegant way, but I am not sure.

tommyod avatar Nov 30 '20 07:11 tommyod

Hi Tommy,

This is a nice trick! I know it would be hard to generalize to multiple dimensions. I don't even fully understand the mathematics of bandwidth in 1D, let alone higher dimension. I am working on geographical data, where x and y are projected to equal distance coordinate reference system. So I guess a diagonal bandwidth matrix should be fine, most certainly better than me assigning an arbitrary bandwidth.

By the way, improved_sheather_jones is the same as KDEpy.FFTKDE(bw="ISJ"), right? I tried your code with ISJ instead for 1d and the plot looks identical.

Thanks again for this highly efficient KDE package!

Best, Jeff

JHWu92 avatar Nov 30 '20 20:11 JHWu92

Yea, it should do the trick for practical purposes at least. I hope to implement some better support, but I don't want to implement too much magic either. At least tricks like these force the user the think explicitly about what they want. :+1:

Yup, improved_sheather_jones is the same as ISJ :)

tommyod avatar Nov 30 '20 21:11 tommyod