Probabilistic-Programming-and-Bayesian-Methods-for-Hackers icon indicating copy to clipboard operation
Probabilistic-Programming-and-Bayesian-Methods-for-Hackers copied to clipboard

Chapter 2 - Alternative PyMC model - unexplained inconsistency with prior results

Open williamscott opened this issue 12 years ago • 7 comments

Near the end of Chapter 2, in the section titled "Alternative PyMC model" a different model for the Privacy Algorithm is introduced. While I think I understand the equivalence of this and the prior model, what I don't understand is the big difference in the resulting posterior distributions.

It might be helpful to mention (even briefly) why this alternative choice of model gives such a different result.

altmodel origmodel

williamscott avatar Nov 14 '13 12:11 williamscott

A few others have noted this before, but thanks for bringing it up. I think this is a convergence issue -- the first model is naively trying to estimate N parameters, where N is the number of individuals polled. This can be quite large, and MCMC will need more iterations to converge. Perhaps I'll lower N, so the results are more inline.

CamDavidsonPilon avatar Nov 17 '13 04:11 CamDavidsonPilon

Actually, just replicating your commentary above in the text would do solve this for me... certainly, then the text would convey that while the two different models are equivalent in a theoretical sense, one is slower to converge as a result of the practicalities of estimation. I didn't get this the first time around :-(

williamscott avatar Nov 17 '13 09:11 williamscott

I think the first model is actually wrong.

I'm just trying to get a better grasp on MCMC myself so I may be wrong, but I wasn't able to follow the first example. When I regenerated the plots with more samples, they kept mismatching in the way I expected:

First model:

privacy_model1

Second model:

privacy_model2

I was following the text until where it says "The researchers observe a Binomial random variable [...]". At this point I wrote this note:

I would have expected that we are done modelling at this point, and can simply tie observed_proportion.value to 35/N, marking it as observed. Why add a binomial below? Isn't what we do above (counting the number of successes of Bernoulli trials) exactly the definition of the binomial, appart from the privacy masking? Making a binomial trial with the observed_proportion seems to me like an additional step not mentioned in the experiment setup, which preserves the expected value of the outcome but adds noise.

But I don't know enough pymc (or mcmc) yet to fix the first model.

martinxyz avatar Jul 17 '14 18:07 martinxyz

I'm glad you ran the mcmc with more samples, it's confirmed to me that something is amiss. I believe you are correct too...

Let me get back to you on this.

CamDavidsonPilon avatar Jul 18 '14 01:07 CamDavidsonPilon

It seems to me that this question on StackOverflow discusses a similar problem. (The problem of attaching observed values to something that isn't a distribution.)

martinxyz avatar Jul 21 '14 21:07 martinxyz

Was this ever resolved? I'm sort of confused by this model as well... Seems to me martinxyz is right about the binomial distribution adding some extra noise, and martinxyz is right about there being no clear way to solve it?

IDK how that could be, as you'd think you could always sample from, say, a uniform distribution with bounds at x +/- epsilon... but then, I have no idea how any of this works.

Can the inconsistency be due to the fact that in one approach we are simulating coin tosses, but in another we simply say that the probability of each coin toss is 0.5. So for the simulated approach we are introducing two more distributions on top of the uniform probability of cheating distributions.

By this logic the simulated approach is more correct as it represents what happened on an actual trial.

ksnikiforov avatar Sep 06 '23 20:09 ksnikiforov