ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers icon indicating copy to clipboard operation
ReLSO-Guided-Generative-Protein-Design-using-Regularized-Transformers copied to clipboard

May the internal information of gifford data leads to a bias results given by model?

Open chengyunzhang opened this issue 3 years ago • 0 comments

I'm very intersted in your work and analysize the gifford data. Firstly, I use the CD-HIT( a Cluster tool) split into different clusters.Then, I chose the sequence (comes the Clsuter-1(a cluster subset contaiing similar sequences given by CD-HIT)) with highest enrich value as a baseline, and focus on the residue difference between it and others sequences. Very interstingly, i find those sequences that containg 2 or 3 different residues compared to baseline sequence, usually have high enrichments. In Top-100 high enrichments, it can at 65%. As i know, your work is a multitask that both focus on generation and prediction. **I wonder that whether the JT-VAE tends to produce the new sequences that different from the corresponding baseline sequence with highest enrichment about 2 or 3 different residues , and the prediction neural network may think such sequences are good results.**It means that the model only need to realize the fact that compared to high enrich sequnces,the new sequnces contain 2 or 3 different residues is good enough. Beacuse i not find your results, i hope you can give me some advices.

chengyunzhang avatar Oct 18 '22 09:10 chengyunzhang