KernelDensity.jl
KernelDensity.jl copied to clipboard
Re-scale data before performing bivariate KDE
If you have data that are highly correlated or have unusual shapes, the current code in BivariateKDE does not give a very good representation of the true distribution because the bandwidths and kernel densities are estimated separately in the x
and y
dimensions. For example, with this data set, the current code applied to this data:
using KernelDensity
using StatsPlots
x = randn(1024)
y = x .+ 0.1 .* randn(1024)
plot(kde((x,y)))
produces
A better approach is to estimate the bandwidth of the KDE as a 2x2 covariance matrix (this is, for example, what SciPy does); but a naive implementation of such a "full rank" bandwidth makes it much harder to employ the FFT-to-a-grid trick used in KernelDensity. An equivalent approach that allows to keep the FFT trick is to first de-correlate the input data by performing a rotation and scale so that is covariance is the identity, then estimate using the current KDE code, then de-rotate onto the original grid. This code performs the manipulations:
function rescaled_bivariate_kde((x, y))
z = hcat(x, y)
mu = mean(z, dims=1)
Sigma = cov(z, dims=1)
L = cholesky(Sigma).L
detL = prod(diag(L))
zr = L \ (z .- mu)'
kr = kde((zr[1,:], zr[2,:]))
k = kde((x,y))
ikr = InterpKDE(kr)
for i in 1:size(k.x, 1)
for j in 1:size(k.y, 1)
zr = L \ (permutedims([k.x[i], k.y[j]]) .- mu)'
p = pdf(ikr, zr[1,1], zr[2,1])
k.density[i,j] = p / detL
end
end
k
end
The result on the above data set is
plot(rescaled_bivariate_kde((x,y)))
which is a much better representation of the actual density. Note: I have checked that the proper Jacobian factors are included:
k = rescaled_bivariate_kde((x,y))
sum(k.density .* step(k.x) .* step(k.y)) # 1.0 to good accuracy
I propose that rescaled_bivariate_kde
replace the existing kde
constructor for bivariate data.