DiffuSeq icon indicating copy to clipboard operation
DiffuSeq copied to clipboard

sampling issue

Open helder-ribeiro opened this issue 2 years ago • 4 comments

Hi all! First I would like to thank you for sharing the code! I'd like to apply diffuseq for some seq2seq tasks involving protein sequences (aminoacid tokens). So first I tested the model in a task to reverse the order of a sequence like I E T M L (source seq) to L M T E I (target seq). I used a training dataset composed of 8K samples and a validation dataset composed of 2K samples. During training, the validation and training loss decreased to ~ 0 as expected (last learning step:)

| grad_norm | 0.0696 | | loss | 0.0783 | | loss_q0 | 0.0783 | | loss_q1 | 0.0783 | | loss_q2 | 0.0784 | | loss_q3 | 0.0783 | | mse | 0.0715 | | mse_q0 | 0.0714 | | mse_q1 | 0.0715 | | mse_q2 | 0.0715 | | mse_q3 | 0.0714 | | nll | 2.76 | | nll_q0 | 2.76 | | nll_q1 | 2.76 | | nll_q2 | 2.76 | | nll_q3 | 2.76 | | samples | 3.84e+06 | | step | 3e+04 |

However, when I generate sequences using the decoder bash script, I obtain this for the last trained step: {"recover": "", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"} This seems that the model only predicted PAD from the source sequence. If I decode from intermediary training steps, I obtain this: {"recover": "[SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"}

This is my first time working on diffusion models applied to sequences, so I don't know if that could be a problem related to hyperparameters. Would you have any thoughts on that?

This is some information potentially important to this case:

  • I used the protBert model from Rostlab/prot_bert hugging face as tokenizer but without initializing the pretrained model (use_plm_init no);
  • Example of my training samples: {"src":"N N E T P","trg":"P T E N N"} {"src":"Q E W Q R","trg":"R Q W E Q"} {"src":"G Q M P M","trg":"M P M Q G"}
  • Training parameters: --diff_steps 2000
    --lr 0.0001
    --learning_steps 30000
    --save_interval 10000
    --seed 102
    --noise_schedule sqrt
    --bsz 128
    --dataset reverse2
    --vocab bert
    --schedule_sampler lossaware
    --notes test-reverse2
    --data_dir /home/ribeiroh/Projetos/DiffuSeq/datasets/reverse
    --seq_len 18
    --config_name Rostlab/prot_bert
    --hidden_dim 128
    --use_plm_init no

Thank you so much!

helder-ribeiro avatar Jan 11 '23 15:01 helder-ribeiro

Hi, I think the model is not well-trained so it can not recover meaningful tokens. Maybe you could try other hyper-params. Another concern is that the size of your dataset is a little bit small.

summmeer avatar Jan 13 '23 03:01 summmeer

Thank you for the thoughts/suggestions. In this kind of model, is there any metric that I could access during the training to check if the model is well-trained? I ask that because the training and validation loss performed well in this case, but even so, the model seems not to be well-trained.

helder-ribeiro avatar Jan 13 '23 13:01 helder-ribeiro

Actually it's not easy, because training and inference stages are not strictly symmetrical. You can try to recover 50% noised data instead of pure Gaussian noise.

summmeer avatar Jan 15 '23 05:01 summmeer

Hi, has the issue been resolved? I've encountered a similar problem where it only generates a single token (my sequences are also proteins). I noticed that during the sampling process, the intermediate values of xt become abnormally large. For example, the range of xt at intermediate timesteps can reach around 9-10 (strangely, the final x0 values fall within the correct range of (-1, 1)). This is clearly abnormal.

jkurpost avatar Mar 05 '25 07:03 jkurpost