cgpm
cgpm copied to clipboard
What should logpdf of an observed cell return (ie given a rowid which was already incorporated)?
Consider the following annotated session with a hypothetical lightweight cgpm language:
# Creates a normal gpm with NIG prior, 1 output variable named 'x' and no inputs.
>> gpm <- normal-gpm(outputs=['x'], inputs=None)
# Returns density of x=2 from the prior.
>> gpm.logpdf(rowid=1, query={'x': 2})
-0.12
# Returns a sample of x_1 | {gpm, data={}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': 1.28}
# Incorporates observation (rowid=1, {'x'=1.28})
>> gpm.incorporate(rowid=1, query={'x': 1.28})
# Returns sample of x_1 | {gpm, data={(rowid=1, {'x'=1.28})}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': 1.28}
# Distribution of x_1 | {gpm, data={(rowid=1, {'x'=1.28})}} has no density
# it is a dirac(1.28) measure which is not absolutely continuous wrt to any
# measure I know of.
logpdf(rowid=1, query={x:1.28})
?? undefined density ??
# Returns sample of x_2 | {gpm, data={(rowid=1, {'x'=1.28})}}
>> gpm.simulate(rowid=2, query=['x'])
{'x': 0.41}
# Deletes observation (rowid=1, {'x'=1.28})
>> gpm.unincorporate(rowid=1)
# Returns 1 sample of x_1 | {gpm, data={}}
>> gpm.simulate(rowid=1, query=['x'])
{'x': -0.18}
The question is what should logpdf
return for an observed cell? Note that simulate
is not a problem since we know the distribution of the constrained random variable, but it does not have a density. In general, I would be happy to always through an error (as we do now) on this query...
BUT ... BQL has this notion of PREDICTIVE PROBABILITY
(which is an ad-hoc approximation of some Bayesian quantity which is unclearly/unrigorously specified in the BayesDB paper, but turns out to be extremely useful for real life data analysis workflows).
So, what do we do? Debate!
Here's one possible interpretation, for a single model, assuming every model has a notion of latent variables: Let a_r
, b_r
be the values of observable variables in the observed row r, and x_r
the value of the latent variable in observed row r; let A_*
, B_*
be observable variables of an unobserved row, and X_*
be the latent variable of an unobserved row. At row r, PREDICTIVE PROBABILITY OF B
can be interpreted as Pr[B_* = b_r | A_* = a_r, X_* = x_r]
.
I think this may even be close to or exactly the same quantity that bayeslite/lovecat effectively computes by shenanigans that look incoherent as written.
@riastradh-probcomp yes, your interpretation is the one that lovecat uses. The reason that I put this issue in cgpm, and not in bayeslite, is how can we define the above operation in terms of the cgpm interface? It seems like an overwhelming violation of abstraction to compute predictive probability, even if we carefully specify the Bayesian quantity it is computing (like you did above).
Suppose A
has index 0 and B
has index 1 in your example above. One possible way to compute the predictive probability is to define a rowid' <- cgpm.clone(rowid)
method, which synthesizes a new observation rowid'
that is identical to rowid. Then we can use cgpm.unincorporate(rowid', query=[1])
, followed by cgpm.logpdf(rowid', query={1:b})
.
Pah...
Well. How gross is it, really?
Suppose we have observed fixed values a_r
, b_r
for the random variables A_r
, B_r
, i.e. we have a distribution conditioned on A_r = a_r, B_r = b_r
among the data D
. Then if we interpret cgpm.logpdf(A_r = 42)
to mean log Pr[A_r = 42 | D]
, obviously the answer should be a resounding 0
. But is that useful? Would anyone ever want to ask that question, in practice?
On the other hand, we could consistently interpret cgpm.logpdf(A_r = 42)
to mean log Pr[A_* = 42 | D, B_* = b_r, X_* = x_r]
-- i.e., any time someone asks about an observed cell in an observed row, give the answer instead for the corresponding cell in a hypothetical row having all the other information from the observed row.
On the third hand, maybe there are cases where it is easy to accidentally ask a qualitatively different question -- about a hypothetical value in an unobserved cell for an otherwise observed row, versus about a hypothetical individual sharing every characteristic in common except one observed cell with an observed row -- in which case perhaps there should be a different name for asking the question.
But maybe it is sufficient to split the name logpdf
into logpdf_observed
and logpdf_unobserved
, instead of any more elaborate API -- do we actually have any uses for asking simultaneously about multiple cells in observed and unobserved rows, as logpdf
currently supports?
On the other hand, we could consistently interpret cgpm.logpdf(A_r = 42) to mean log Pr[A_* = 42 | D, B_* = b_r, X_* = x_r] -- i.e., any time someone asks about an observed cell in an observed row, give the answer instead for the corresponding cell in a hypothetical row having all the other information from the observed row.
@riastradh-probcomp this formalisms seems to me the most reasonable one. First of all, it is model independent i.e. there is no notion of telling the cgpm to "use the same latents for the hypothetical row as the observed row", and second of all its implementation is closest to what is currently being done. I am going to extend your interpretation above of cgpm.logpdf(A_r=42)
to the case there is an evidence clause, i.e. cgpm.logpdf(rowid=r, query={A:42}, evidence={B:5})
where the behavior will be to compute:
Pr[A_* = 42 | B_*=5, X_*=x_r
the point being that b_r
has been replaced with the user-specified constraint B_r=5
, while all other row variables are reused.
How the cgpm decides to deal with the latents for the hypothetical row in the query (reuse them from the observed row, resample them based on the user-specified evidence, marginalize over them, etc) is not specified by the interface. Different cgpms will have the ability to optimize the query differently.
It will be some time before I uniformly refactor all cgpms in the library to adhere to the above convention. (Addition: And it is not straightforward to program the above logic for the venturescript cgpm).
Further justification of Pr[A_* = a_r | B_* = b_r]
: When, in Crosscat, we evaluate PREDICTIVE PROBABILITY OF A
as a Monte Carlo integral of Pr[A_* = a_r | B_* = b_r, X_* = x_{r,i}, M = m_i]
over all models m_i (i.e., clusterings) and latent variables x_{r,i}
(i.e., the category assignment of row r in model i), I suspect we are effectively computing a Monte Carlo estimate of Pr[A_* = a_r | B_* = b_r]
already. If, instead, we evaluated a Monte Carlo integral of Pr[A_* = a_r | B_* = b_r, M = m_i]
, i.e. averaging over all possible latent variables of row r given the model i, I think that would be another Monte Carlo estimate of the same quantity Pr[A_* = a_r | B_* = b_r]
.