GeoDiff icon indicating copy to clipboard operation
GeoDiff copied to clipboard

Disagreement with paper

Open EDAPINENUT opened this issue 3 years ago • 6 comments
trafficstars

I doubt that the code is still based on score-matching methods. In models/epsnet/dualenc.py, line 478, the noisy sample in forward diffusion process is different from the eq.(4) in DDPM, and the equation with subscription 2 in the paper. So that the noise calculated by d_gt and d_perturbed is hard to understand. Second, the encoder's forward method does not embed time_step as the parameters to calculate the noise. In langevin_dynamics_sample_diffusion method, the sampling process is also different from the Algorithm 1 in the paper. Why is the step_size formulated as shown in line 443? Can you give me more details on the implementation? Maybe the code is based on the improved version of the original DDPM, such as score-based ones? How can I find more materials to understand your code where differences occur?

EDAPINENUT avatar Jun 19 '22 14:06 EDAPINENUT

Hi Haitao,

Sorry for the late reply since I just traveled to SF last week and spent a lot of time getting jet lag and settling down. Just reply briefly here to confirm that I have received your questions, and may find a later time to discuss the details if they are not clear enough for you :)

Briefly, d_gt and d_perturbed are mainly the variables involved in the alignment-approach, which is described in the paper. And for your sentence that the code is still based on score-matching methods, actually I agreed with this! DDPMs essentially are the same as DSMs, and many papers mixed these two terms. Secondly, for embedding the timestep, we also draw inspiration from DSMs to implicitly implement it: 1. use timestep to find the noise scale, and 2. use this noise scale to reweight the network output. And for the sampling alg langevin_dynamics_sample_diffusion, yes it is different from Alg.1 in paper and it's a Langevin dynamics process, widely in DSMs, which shows a better result in practice.

Thanks a lot for your interest in our work!!

Best Minkai


Brief reference: DSMs: http://yang-song.github.io/blog/2021/score/ DDPMs: https://lilianweng.github.io/posts/2021-07-11-diffusion-models/

MinkaiXu avatar Jun 27 '22 01:06 MinkaiXu

I think DDPM is a variation preserving process and DSM is the exploring one. When your method is variation preserving one, the DIFFUSION may not be a suitable description. The noisy scheme in your paper on distance diffusion (Line.9 in Chain-rule approach) is different from your code. I change it to match your paper and find the loss will not converge.

In my perspective, it is somehow serious. The chain-rule approach cannot be implemented with the diffusion (variation preserving) process in your framework, but it works well in the variation exploring process. However, one of your main contributions is to employ diffusion model for the conformation generation task. If it is still a score-matching model, the difference between GeoDiff and CONFGF may not be very significant.

EDAPINENUT avatar Jun 28 '22 04:06 EDAPINENUT

Hi Haitao,

Yes, you are careful of all the details, where the perturbation process seems not variation-preserving. Actually, I have thought about this a lot before and maybe we can have some discussion here : )

First, let's recall that you previously also mention that timestep should be embedded. Actually, in my implementation, I view the model as: 1) use timestep to get the corresponding alpha, and 2) use the alpha to rescale the data to achieve variation preserving. So, for the perturbed data in my code, technically it should be "diffusion data (shown in the paper) divided by \sqrt{alpha-bar}". Then, this is naturally the same case for the distances of the data in my code, i.e., the noisy distances are also divided by \sqrt{alpha-bar}. So, to recover the theoretical value of perturbed distances d_t, they should be multiplied with \sqrt{alpha-bar} to recover back. Here, I think your main concern is in this line (https://github.com/MinkaiXu/GeoDiff/blob/main/models/epsnet/dualenc.py#L297) that why I also rescale d_perturbed, so now I should say this because the data is also rescaled. Independent to this paper, for general DDPMs: after I realized the rescale setting with the timestep, I found that DSMs and DDPMs indeed are very similar. We can even have a look at the sampling of DDPM: when the x_t and x_t-1 are all rescaled with \sqrt{alpha-bar_t}" and \sqrt{alpha-bar_t-1}", the 1/sqrt(alpha) will also be removed and the sampling process will also preserving. Happy to discuss if I happen to make some mistake here!

Secondly, then, why I do this kind of "complicated" rescale staff? Notice that, the problem of variation preserving is: it will change the scale of coordinates, from the original data scale to normal Gaussian size, where numbers are much smaller (very likely are <1). This will make the GNNs fail, with the predefined radius for constructing the original scale spatial graph (making the graph fully connected), predefined RBF functions (works not well with significantly shorter distances). So I'm thinking about this kind of rescaling thing before fed into GNNs.

Then, for the problem that why your implemented version doesn't work, there should be some reasonable points now. Personally, 1) the data actually is still rescaled, so just modifying the chain-rule part makes the code not consistent. 2) even further changing the data rescale part, this falls back to my very first attempts. One needs to set all the hyperparameters (radius, RBFs and so on) following a "timestep-adaptive" way, which would be a really hard work.

Besides, as the rescale setting in DDPM sampling I have discussed above, I actually also take this into consideration for writing the sampling code, and there is a DDPM sampling following this logic. Here I would admit that, though this DDPM sampling method can achieve competitive results, the margins are not significant as Langevin Dynamics. Thus I release LD as the default sampling alg.

Finally, since you express doubt about the contribution, I also would like to chat a little about this. In my mind, I always take the main contribution as to: directly act on coordinates that solve the training-inference inconsistent problem. Like in ConfGF, during training the noisy distances can break triangular inequation, which is not the case during sampling. This is always highlighted across the intro and related work. DDPMs are nice probabilistic frameworks to motivate how I should process coordinates, and DSMs are also nice. Indeed I have two codebases for both, but finally write the paper with diffusion models since there is already the DSM paper.

Sorry for the kind of "complicated" implementation consideration that is not explained in the paper, but I personally think this part indeed is so detailed and specific that not suitable for a general methodology paper. Hope these can solve some of your questions :)

Best Minkai

MinkaiXu avatar Jun 28 '22 06:06 MinkaiXu

Hi, Minkai. I also agree with you that the time embedding is not necessary to used for calculating the noise.

Thanks for your detailed explanation. I am also thinking the rescaled (sqrt(\alpha)) term destroys the 3D coordinate information, which is explained by DDPM is that the input signals are corrupted gradually. But the diffusion process may not be very suitable in 3-D coordinate signals.

Maybe in the next version, the discussion could be appended in the paper, so that other followers may not doubt that the diffusion process does not agree in implementation.

These days, another paper <EDM: E(3) Equivariant Diffusion Model for Molecule Generation in 3D.> seems fix the problem. From the scaling view, I will check their code further and find out what improvements they have achieved to make the variation-preserving framework work.

Thanks a lot for your explanation in the perspective of rescaling the coordinates, and once I find a general solution to rescaling the coordinates while preserving the variation, I will send you an e-mail on it for further discussion.

EDAPINENUT avatar Jun 28 '22 07:06 EDAPINENUT

At first, I guess that the second global encoder makes the code invalid since the rescaling problem with the cutoff hyperparameter. And I use a single encoder version where the global encoder is abandoned, but the variational preserving one still does not work well.

I think more efforts should be taken on this problem, and if there is any technical progress on fixing this working well, I hope we can discuss further :)

EDAPINENUT avatar Jun 28 '22 07:06 EDAPINENUT

Yes, indeed. I also found it's the GIN encoder that makes the main difference.

I also enjoy reading EDM and actually have looked through their codes, though I'm not sure whether adopting their method directly for conditional conformation generation can beat GeoDiff (and other related works). This is because the molecules generated by recent molecule generation models seem still not very realistic, so I doubt their benchmark might be weaker than conformation generations and thus simpler formulation can achieve "competitive" results.

But yeah, their method is insightful, which mathematically also shows some differences. Glad to discuss if you also have some thoughts on it, and maybe we can find some exciting solutions to further improve geometric diffusion!

MinkaiXu avatar Jun 28 '22 17:06 MinkaiXu