DiffuSeq
DiffuSeq copied to clipboard
sampling issue
Hi all! First I would like to thank you for sharing the code! I'd like to apply diffuseq for some seq2seq tasks involving protein sequences (aminoacid tokens). So first I tested the model in a task to reverse the order of a sequence like I E T M L (source seq) to L M T E I (target seq). I used a training dataset composed of 8K samples and a validation dataset composed of 2K samples. During training, the validation and training loss decreased to ~ 0 as expected (last learning step:)
| grad_norm | 0.0696 | | loss | 0.0783 | | loss_q0 | 0.0783 | | loss_q1 | 0.0783 | | loss_q2 | 0.0784 | | loss_q3 | 0.0783 | | mse | 0.0715 | | mse_q0 | 0.0714 | | mse_q1 | 0.0715 | | mse_q2 | 0.0715 | | mse_q3 | 0.0714 | | nll | 2.76 | | nll_q0 | 2.76 | | nll_q1 | 2.76 | | nll_q2 | 2.76 | | nll_q3 | 2.76 | | samples | 3.84e+06 | | step | 3e+04 |
However, when I generate sequences using the decoder bash script, I obtain this for the last trained step: {"recover": "", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"} This seems that the model only predicted PAD from the source sequence. If I decode from intermediary training steps, I obtain this: {"recover": "[SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP] [SEP]", "reference": "[CLS] L M T E I [SEP]", "source": "[CLS] I E T M L [SEP] [SEP]"}
This is my first time working on diffusion models applied to sequences, so I don't know if that could be a problem related to hyperparameters. Would you have any thoughts on that?
This is some information potentially important to this case:
- I used the protBert model from Rostlab/prot_bert hugging face as tokenizer but without initializing the pretrained model (use_plm_init no);
- Example of my training samples: {"src":"N N E T P","trg":"P T E N N"} {"src":"Q E W Q R","trg":"R Q W E Q"} {"src":"G Q M P M","trg":"M P M Q G"}
- Training parameters:
--diff_steps 2000
--lr 0.0001
--learning_steps 30000
--save_interval 10000
--seed 102
--noise_schedule sqrt
--bsz 128
--dataset reverse2
--vocab bert
--schedule_sampler lossaware
--notes test-reverse2
--data_dir /home/ribeiroh/Projetos/DiffuSeq/datasets/reverse
--seq_len 18
--config_name Rostlab/prot_bert
--hidden_dim 128
--use_plm_init no
Thank you so much!
Hi, I think the model is not well-trained so it can not recover meaningful tokens. Maybe you could try other hyper-params. Another concern is that the size of your dataset is a little bit small.
Thank you for the thoughts/suggestions. In this kind of model, is there any metric that I could access during the training to check if the model is well-trained? I ask that because the training and validation loss performed well in this case, but even so, the model seems not to be well-trained.
Actually it's not easy, because training and inference stages are not strictly symmetrical. You can try to recover 50% noised data instead of pure Gaussian noise.
Hi, has the issue been resolved? I've encountered a similar problem where it only generates a single token (my sequences are also proteins). I noticed that during the sampling process, the intermediate values of xt become abnormally large. For example, the range of xt at intermediate timesteps can reach around 9-10 (strangely, the final x0 values fall within the correct range of (-1, 1)). This is clearly abnormal.