crosscat
crosscat copied to clipboard
Crosscat hyperprior grid on variance parameter is broader than it needs to be
Crosscat adopts a uniform hyperprior over the parameters of the Normal Gamma prior distribution on the Normals (see footnote 1, p. 8.) For the "variance" parameter, this is done by constructing a grid from roughly 0 to sum((x-x̄)^2) in [construct_continuous_specific_hyper_grid
](https://github.com/probcomp/crosscat/blob/6dadb9b33f7111449d5daf5683a1eac6365431a4/cpp_code/src/utils.cpp#L434}. The largest variance which makes sense for a sample {x} is max((x-x̄)^2), though. Since this is a grid of 31 elements, we're potentially losing a fair bit of precision here, and may be able to tighten up convergence a bit by tightening this bound.
Sounds reasonable to me. What about the lower bound? .01*sum((x-x̄)^2) as we currently use seems pretty arbitrary to me -- surely there could be clusters with much smaller variance than that, very far away from one another.
You could specify 0 as a lower bound for log_linspace
, and you would get a grid starting at the smallest positive normal floating-point number.
I suppose the smallest distance between any pair would be a better starting point, but that takes O(n^2) time to compute.
I guess you can do it in O(n*log(n)) by sorting, since closest pairs will be adjacent in the sorted list.
Right -- I was inexplicably thinking of >1-dimensional spaces. That would probably be a reasonable thing to do, then.
I think you can probably even do closest points in high-dimensional spaces with a KD-tree.
Oh, there's a wikipedia page about this exact problem.
Golly, my memory of computational geometry has rotted.
However, 0 may also nevertheless be a reasonable choice to start anyway -- for isolated outlying clusters we don't have a reasonable lower bound on their variance.
Yes, the current lower bound seems likely to cause a problem.