Repeated sequence generation of ESM3 with condition
I am trying to use ESM3 for sequence design by specifying the coordinates and amino acid composition of key motifs.
However, I found that using the default temperature of 0.7 resulted in repetitive sequence generation on certain proteins.
For example, "MAAAAAAAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEAAAAAAAAAAAAAEAAAAAAAAAAAAAAAAAKADAADAEAIAAAAAAGATPVAKAANEALATYTAKAGVIFAQDQGKNAQALPAIQAAHAAFASARYIAAYARGAAAYALAGVLDAAAAAGIAIAAAAAAAAAAAKTAAAGLAAAAAAAAAATAAAAAKAAVAAAAAAAAAATASANAAAMAAAAAAPEDTATAAGIALLPVPGDLAAAAAAAAAAAAAAAAAAAAAAVAAAAAAAAVAAAAAAAAAAVAAAAGAKAAAAAAAAAALAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADQGAEWLSRLDRGANAAAAAAAAAAGAAAAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAAAAAAGAAATALPRGGAKGALLAAEGGASVLATGGGRFAPRIRDLADVAAPANGLKDAGAYEAAGGALKGAAAGAVAAAGAAAAAVAAAAATAGATGFLATANGLAAIGSDLAAVTVAVAAGINAAANSAGAQALNKGEAINAAFSAAGAAAAAQAAATADNAAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVAAAAAAA".
The original residue information comes from the chain A of 7qd4. I prompted a small DNA-binding region.
When I raised the temperature factor to 1, the duplication eased, but the plddt and ptm indicators dropped significantly.
This may be a common problem with language models. It would be best if you could devise a more rational decoding strategy.
Thanks for the great study!
Do you have code for this? We don't see this much repetition at T=0.7, this is a bit surprising. What's the exact conditioning prompt?