pomegranate icon indicating copy to clipboard operation
pomegranate copied to clipboard

Added evidence-based Gibbs sampling to Bayesian Networks

Open NamesAreAPain opened this issue 5 years ago • 12 comments

Although the code could be made more efficient by leveraging numpy, this basic implementation of Gibbs sampling works adequately for plenty of purposes.

NamesAreAPain avatar Nov 11 '19 03:11 NamesAreAPain

Thanks for the contribution. I'm trying to understand the code now. I'll get back to you soon about this PR.

jmschrei avatar Nov 23 '19 23:11 jmschrei

It would also be helpful if you could add some unit tests to ensure that sampling (with a fixed seed) gives the expected behavior.

jmschrei avatar Nov 24 '19 00:11 jmschrei

Hi,

Some colleagues and I are really interested in this and we are willing to contribute. I tried the pull requests and noticed breaking issues that need correcting and testing.

Since it has been proposed several month ago, we would like to propose a new pull requests. Is that fine with every one ?

pascal-schetelat avatar May 05 '20 07:05 pascal-schetelat

That's fine with me. If you could add unit tests and appropriate documentation on your additions as well that would be great.

jmschrei avatar May 07 '20 04:05 jmschrei

Ok, I have a fisrt implementation using rejection sampling in order to take into account evidence on non marginal nodes.

I have resulted to use networkx topological sort to order the nodes. I didn't manged to find it else where in pomegranate, did I missed something ?

I'll make a pull requests once tests are in place.

As for API, this is what I have for now, let me know if this depart too much from what you had in mind :

	def sample(self, n=1, evidences=[{}],min_prob=0.01):
		"""Sample the network, optionally given some evidences
		Use rejection to condition on non marginal nodes

		Parameters
		----------
		n : int, optional
				The number of samples to generate. Defaults to 1.
		evidences : list of dict, optional
				Evidence to set constant while samples are generated.
		min_prob : stop iterations when  Sum P(X|Evidence) < min_prob. generated samples for a given evidence will be
				incomplete (<n)
		Returns
		-------
		a nested list of sampled states
		"""
```

pascal-schetelat avatar May 07 '20 12:05 pascal-schetelat

Ok, so rejection sampling was easy.

Gibbs sampling, on the other hand require a little bit more work to build the transition kernel of the markov chains.

Nothing impossible, but I stumbled upon an issue when data do not contains all possible combination. This break distribution.joint() and marginal()

The issue seems to be adressed by https://github.com/jmschrei/pomegranate/pull/680

Do you know if/when this could be merged ?

Edit : nevermind, using model.graph.state.distribution instead of model.state.distribution did the trick without having to re-estimate marginals.

pascal-schetelat avatar May 12 '20 13:05 pascal-schetelat

What is the status of this? Are you looking into adding Gibbs sampling or do you want to move forward with this addition? If so, can you add unit tests?

jmschrei avatar May 16 '20 22:05 jmschrei

I started with rejection sampling wich is easy to implement but quite slow as soon as evidences are specified on non root nodes, due to a large rate of rejection. This is working for 2 weeks in my fork, but tests still need to be implemented.

If you want to look : https://github.com/cstb/pomegranate

I also started to implement gibbs sampling, which is a bit more involved than rejection sampling. I tried a couple of unsuccesfull approches, but I am happy to report that I am finally getting there, and I expect to submit a pull requests in the next couple of week.
I'm currently comparing results with the rejection sampler to make sure everything is ok. I still need to work on seed management as I don't rely on numpy for weigthed choice but on a cython function.

On a 12 nodes network with an average of 2-3 parent by node, the Gibbs sampler takes 150 ms to draw 1000 samples without evidence. The more evidence, the faster it goes. The algorithm execution time is roughly linear with

  • the number of samples
  • the number of unknown nodes
  • the number of edges in the markov blanket of unknown nodes

So far, on the network I tried, I did not observe the need to dicards burn in samples, but it is a bit early to be definitive.

pascal-schetelat avatar May 17 '20 13:05 pascal-schetelat

By the way, I should probably have not relied on that assumption, but is there a good reason why distribution.keys() and keys in the distribution.keymap are sometimes ordered the same, and sometimes they are not ?

pascal-schetelat avatar May 17 '20 21:05 pascal-schetelat

Hi @pascal-schetelat Firstly, many thanks for this contribution.

Is the gibbs sampler for BN's has an inplementation yet?

ekosman avatar Jun 14 '20 13:06 ekosman

I just made the PR here : https://github.com/jmschrei/pomegranate/pull/778

pascal-schetelat avatar Jun 17 '20 19:06 pascal-schetelat

Great! @pascal-schetelat

Looking foward to using it :)

ekosman avatar Jun 17 '20 19:06 ekosman

Thank you for your contribution. However, pomegranate has recently been rewritten from the ground up to use PyTorch instead of Cython, and so this PR is unfortunately out of date and will be closed.

jmschrei avatar Apr 16 '23 06:04 jmschrei