scipy ENH: stats.gaussian_kde: add method that returns marginal distribution

Following the discussion at #9906, added pdf_marginal and logpdf_marginal to scipy.stats.kde, which are useful for visualizing high dimensional distributions.

Mar 10 '19 20:03 h3jia

I just rewrote evaluate, logpdf, pdf_marginal and logpdf_marginal. Now all of them use np.einsum instead of looping over the data or points. I think it's not necessary for evaluate to do whitening. For evaluate and logpdf, they are still using self.inv_cov because it's available, while pdf_marginal and logpdf_marginal are not explicitly inverting the covariance.

The einsum approach needs more memory so it fails in the 32-bit tests. I'll fix this later, probably return to the loop approach.

Mar 11 '19 08:03 h3jia

@rgommers @ilayn

Now the problem is, without stuffs like tree-based computation, kde can be expensive for large data set.

Previously it was looping over the data points or input points. This needs less memory, but I think it should be slow.

I just tried to use einsum or tensordot to vectorize it, but then it could not pass the test because of memory limit. Since for each input point, we need to evaluate its distance to each data point, so we will need to manipulate a (# of dim, # of data, # of input) array.

Could you tell me your opinions on this?

I think an intermediate but less clean way can be, when # of data or # of input is too large, divide it into smaller blocks and do einsum on each block. This can be regarded as some kind of trading between memory and run time.

Mar 11 '19 20:03 h3jia

I was interested in this same behavior, but also numerous other features (e.g. different kernels, reflecting boundary conditions, some custom behavior in resampling... etc). I decided it really warranted a separate package, instead of trying to add all of this functionality to the existing scipy submodule. For anyone interested, that new package (highly beta) can be found here: kalepy.

That being said, I don't know what the optimal tradeoff is for putting large/complex submodules into the scipy package. Should the gaussian_kde functionality be kept as simple, general KDE functionality, and leave complex behavior for another package? Or should this scipy method be expanded (arbitrarily) to include all desired behavior? If anyone has suggestions/comments (e.g. @rgommers), I'm very curious what the consensus thoughts are. I'm also very happy to help incorporate all or some of the additional functionality into the scipy module (of if the consensus is 'all') then perhaps fold this new package into scipy.

@HerculesJack one technical comment: I ran into the same memory vs speed issue you describe above in the resample functionality, and found that implementing a 'chunking' procedure works extremely well (~10--100x speedup, with very small memory footprint). Resampling in chunks of ~1e5 optimized speed and memory requirements for dimensions (numbers of parameters) between 1 and 5. I did not have any issues in the evaluation functionality.

Jun 24 '19 15:06 lzkelley

@HerculesJack thought I'd check in on this PR. Would you like to finish it up? What do you need from us?

Jul 18 '22 05:07 mdhaber

@mdhaber This is a quite old thread. It did not get merged because my naive implementation could not pass the tests due to memory and/or run time limits. I no longer need this feature for my own application, but am happy to help if you would like to take it over. I think you'd need to first review the code and resolve the conflicts, and then figure out a way to split the computation in batches so that it won't hit the memory cap.

Jul 18 '22 06:07 h3jia

Thanks for the summary @HerculesJack. In that case, I'll probably record the feature idea somewhere and close the PR.

@lzkelley I've thought about the idea of leaving KDE enhancements to another package, too. As a stats maintainer, I'm conflicted. On one hand, it's very interesting; e.g. adding back support for degenerate data and eliminating use of inv has gripped me. On the other, my attention is pulled in a lot of directions, so more complicated PRs tend to get proportionally less attention. In any case, I think KDE functionality would be able to grow a lot faster in a separate package; the downside is just that that users would not already have that at their fingertips (without looking for it).

I may create a meta-issue summarizing KDE's several enhancement requests and inactive enhancement PRs. If so I'll let you know.

Looks like kalepy's last release was a while ago, but you created some issues a few months ago. How's the project coming? Are there any features in SciPy's KDE that it does not support?

Jul 18 '22 13:07 mdhaber

@mdhaber

Looks like kalepy's last release was a while ago, but you created some issues a few months ago. How's the project coming? Are there any features in SciPy's KDE that it does not support?

kalepy is quite functional, I would say. Unfortunately I haven't had much time to keep developing / improving it, but I do frequently use it in a number of different projects. Currently, it's functionality is a superset of scipy's, but there's definitely lots more features that could be implemented. The most pressing improvement would be additional optimization. There's a pending PR for multithreading, for example, that needs to be looked at (I'm more familiar with MPI parallelization, so I would need to learn more about multithreading per se).

Jul 18 '22 13:07 lzkelley

Thanks again @HerculesJack. One last question: this PR adds methods for computing the pdf/logpdf of a marginal distribution of a KDE distribution. I'm thinking that rather than having a separate methods for pdf/logpdf only, it would make sense to have a single method, e.g. marginalized, that simply returns a KDE object representing the marginalized distribution. Because the marginal PDF of a multivariate normal distribution is the PDF of a multivariate normal distribution using a subset of indices of the original mean and covariance matrix, the implementation is quite simple at its core: just extract the relevant indices of the dataset and covariance matrix and instantiate a new KDE instance with that information. The PDF/logpdf code would not need to be copied/rewritten. Does this make sense?

Aug 10 '22 15:08 mdhaber

Thanks again @HerculesJack. One last question: this PR adds methods for computing the pdf/logpdf of a marginal distribution of a KDE distribution. I'm thinking that rather than having a separate methods for pdf/logpdf only, it would make sense to have a single method, e.g. marginalized, that simply returns a KDE object representing the marginalized distribution. Because the marginal PDF of a multivariate normal distribution is the PDF of a multivariate normal distribution using a subset of indices of the original mean and covariance matrix, the implementation is quite simple at its core: just extract the relevant indices of the dataset and covariance matrix and instantiate a new KDE instance with that information. The PDF/logpdf code would not need to be copied/rewritten. Does this make sense?

Yes, that makes sense to me!

Aug 10 '22 15:08 h3jia

OK, I resolved merge conflicts and added a commit (https://github.com/scipy/scipy/pull/9932/commits/0157c13325d54e5935b2c122c13fa53e037b8497) that shows the concept of the new approach. The tests (including one showing that old and new approaches produce the same PDF values) passed, so I'll remove the old approach.

I'm going to send an email about this to the mailing with three questions.

What should the method that returns the marginal distribution be called? marginalize is a verb (good), but may not be acceptable due to other meanings of the word. marginal, get_marginal, or get_marginal_distribution would also work. get_marginal_distribution seems clear, and the method names of gaussian_kde are not super concise already (e.g. integrate_box_1d, integrate_gaussian), so I think that's my suggestion.
In the original PR, the axis argument specified the dimensions that get "marginalized out" (i.e. those one would integrate over to obtain the marginal distribution). I changed this already to the opposite - the ones corresponding with the "marginal variables", which are kept. This leads to two questions: a. Should the user specify the variables that get "marginalized out" or the "marginal variables" that are to be kept? Update: I think the user should specify the indices of the ones to be kept, the marginal variables. b. Either way, the name axis is not really appropriate because it is used in a different way throughout NumPy and SciPy to refer to the axis/axes of an array, whereas here was are using the argument as indices along the zeroth axis of the dataset array (improving efficiency, e.g. by avoiding re-calculation of the covariance matrix and/or its inverse would be a separate PR). What should the name of this parameter be instead? I'd suggest marginal_variables.

This excerpt from Wikipedia might help those not already familiar with terminolgy (like me):

Aug 10 '22 22:08 mdhaber

Email to mailing list sent 8/11. No response so far, but we have a contributor and two stats maintainers who think this is a good idea. Yes, the implementation is simple, but I think it is very useful to users who don't happen to know that the marginal distribution of a multivariate normal takes such a simple form.

@tupui I'm happy with this if tests pass and the latest changes look good to you.

Aug 18 '22 01:08 mdhaber

scipy scipy copied to clipboard

ENH: stats.gaussian_kde: add method that returns marginal distribution

scipy
scipy copied to clipboard