sliced_score_matching Puzzles about Inconsistency between code and article

# losses.sliced_sm
def sliced_score_estimation(score_net, samples, n_particles=1):
    dup_samples = samples.unsqueeze(0).expand(n_particles, *samples.shape).contiguous().view(-1, *samples.shape[1:])
    dup_samples.requires_grad_(True)
    vectors = torch.randn_like(dup_samples)
    vectors = vectors / torch.norm(vectors, dim=-1, keepdim=True)

    grad1 = score_net(dup_samples)  # H, estimation of score
    gradv = torch.sum(grad1 * vectors)  # project H with v
    loss1 = torch.sum(grad1 * vectors, dim=-1) ** 2 * 0.5  # second term of J(\theta) 
    grad2 = autograd.grad(gradv, dup_samples, create_graph=True)[0] # grad of h w.r.t samples(z)
    loss2 = torch.sum(vectors * grad2, dim=-1)

    loss1 = loss1.view(n_particles, -1).mean(dim=0)
    loss2 = loss2.view(n_particles, -1).mean(dim=0)

    loss = loss1 + loss2
    return loss.mean(), loss1.mean(), loss2.mean()

# losses.vae.elbo_ssm
z = imp_encoder(X)
ssm_loss, *_ = sliced_score_estimation_vr(functools.partial(score, dup_X), z, n_particles=n_particles)

To my understanding, grad1 is the estimation of score $h = S_{m}(x;\theta)$ and loss2 is the first term of $J(\theta)$, which is $v^{T}\nabla_{x}h(x;\theta)v$. But in the code, it seems to be calculated as $v^{T}\nabla_{z}h(x;\theta)v$.

Sep 28 '22 03:09 QJ-Chen

Yes @chen-qj, I noticed this too. Did you figure out why?

Oct 21 '22 07:10 cnut1648

I noticed another question. The multiplication of vectors and grad1/2 is element-wise but in the paper, it is matrix multiplication. Or I misunderstand the theory?

Nov 18 '22 08:11 ifgovh

# losses.sliced_sm
def sliced_score_estimation(score_net, samples, n_particles=1):
    dup_samples = samples.unsqueeze(0).expand(n_particles, *samples.shape).contiguous().view(-1, *samples.shape[1:])
    dup_samples.requires_grad_(True)
    vectors = torch.randn_like(dup_samples)
    vectors = vectors / torch.norm(vectors, dim=-1, keepdim=True)

    grad1 = score_net(dup_samples)  # H, estimation of score
    gradv = torch.sum(grad1 * vectors)  # project H with v
    loss1 = torch.sum(grad1 * vectors, dim=-1) ** 2 * 0.5  # second term of J(\theta) 
    grad2 = autograd.grad(gradv, dup_samples, create_graph=True)[0] # grad of h w.r.t samples(z)
    loss2 = torch.sum(vectors * grad2, dim=-1)

    loss1 = loss1.view(n_particles, -1).mean(dim=0)
    loss2 = loss2.view(n_particles, -1).mean(dim=0)

    loss = loss1 + loss2
    return loss.mean(), loss1.mean(), loss2.mean()

# losses.vae.elbo_ssm
z = imp_encoder(X)
ssm_loss, *_ = sliced_score_estimation_vr(functools.partial(score, dup_X), z, n_particles=n_particles)

To my understanding, grad1 is the estimation of score h=Sm(x;θ) and loss2 is the first term of J(θ), which is vT∇xh(x;θ)v. But in the code, it seems to be calculated as vT∇zh(x;θ)v.

The Author is not using score matching to learn the data distribution $x$, instead, he uses the score matching to compute the entropy's gradient of implicit distribution. So, the code is computing the gradient of $z$ instead of the $x$.

Mar 27 '24 08:03 dongdongunique

I noticed another question. The multiplication of vectors and grad1/2 is element-wise but in the paper, it is matrix multiplication. Or I misunderstand the theory?

They are equivalent. Flatten the data into one dimension, you will find it easier to understand.

Mar 27 '24 12:03 dongdongunique

sliced_score_matching sliced_score_matching copied to clipboard

Puzzles about Inconsistency between code and article

sliced_score_matching
sliced_score_matching copied to clipboard