[Some questions about implementation]
Hi, I'm Junmo Cho.
I've read the paper which was pretty interesting. Sorry for taking your time, but while running the code, I've got some questions.
- Is minus of binary_cross_entropy between img, and pred_img coming from assuming the reward distribution as Bernoulli distribution? I thought that for each pixel in img (which is gt target, and value is 1 or 0) is used as Ber(y|pi) = pi^y * (1-pi)^(1-y) where y is pixel and from for each pixel dist in pred_img, we input it for pi. Please correct me if my understanding is wrong.
- Another thing is why do we divide steps (which is length of generation sequence of GFN - 16 here) for logprobs, and reward when calculating TB loss? I thought that logprobs is itself log of production of P_F(s_i | s_{i-1}) from i=1 to n as in the paper.
- Also, why there is no backward policy term in the TB loss? Are we assuming backward policy as uniform and involve it in logZ?
It would be grateful if I can have some answers! Thanks.
Hi Junmo, thanks for your interest in the paper.
-
The reward distribution is a distribution over latent vectors (the discrete code sequences), not images, and the reward for the latent $z$ is $R(z)=p(z)*p(x|z)$, where $p(z)$ is the prior and $p(x\mid z)$ is the decoder's likelihood of generating $x$ from the latent $z$. The term you are asking about computes $\log p(x\mid x)$: the negative binary cross-entropy between the decoder's output (
pred_img) and $x$ (gt_img) is the same as the log-likelihood of observing the target image $x$ when all the pixels are sampled from independent Bernoullis whose parameters' binary logits are given by the decoder's output. -
This is just a trick for numerical stability. The TB loss for a trajectory $s_0\rightarrow s_1\rightarrow\dots\rightarrow s_n$ is $(\log Z+\log\prod P_F(s_i\mid s_{i-1})-\log R(s_n))^2$, and we simply divided all terms inside the square by $n$ (and absorbed the $\frac1n$ constant into $\log Z$). The optimisation problem is equivalent with and without such division by $n$.
-
The generation of the latent is autoregressive (we sample the entries of the latent code one by one, in order). Therefore, each noninitial state has only one parent: the parent of a state (partial code) where the first $k$ entries have been chosen is the state where the first $k-1$ entries have been chosen and the $k$-th one is undetermined. This makes $P_B(s_{k-1}\mid s_k)$ automatically 1 for all transitions $s_{k-1}\rightarrow s_k$. It would be possible to use a different generation scheme where the policy can set the value of any undetermined entry in the latent code, not necessarily following a fixed order. In that case, we would indeed have a nontrivial backward policy. If the backward policy were fixed to uniform, it would be possible, as you noticed, to omit it from the loss and merge it into $\log Z$, since the product of backward transition likelihoods would be the same for all trajectories.