R2DT
R2DT copied to clipboard
Add method to determine the quality of a diagram
R2DT can produce diagrams which are either not visually pleasing or have some issue with the secondary structure. Visually this can mean has a lot of overlaps (https://rnacentral.org/rna/URS000000AD71/1828414), or the diagram has helices look like they should be placed closer to each other than they are (https://rnacentral.org/rna/URS00009F9E0C/358574), or other issues. From the pairing side this often means an underfolded structure (https://rnacentral.org/rna/URS0000004EA0/2261). Sometimes this could be because it is a small fraction of the whole model (though other times it is great that R2DT works with such cases) or the a small part of the whole sequence (though again not always). I think work on this issue will have to span both the display and folding parts.
For this issue I would like to have R2DT produce fewer diagrams that have visual issues and indicate to users when the secondary structure itself has some issue. I think this is something @nawrockie and @davidhoksza could help with.
Current status
I have collected examples of these issues at: https://www.google.com/url?q=https://docs.google.com/spreadsheets/d/1FwVJ3p457e5-r4GPeOkZO7avH3pe_RAo9ji8zvHulmw/edit?usp%3Dsharing&sa=D&source=editors&ust=1624436833207000&usg=AOvVaw0LOkgNqMM_tYW-JMHkPcd4. The spreadsheet shows the diagrams that have issues and those that do not. The predicted_should_show
column result of a model while the Labeled Should show
is what I thought the diagram should be. If Labeled Should show
is green that means it agrees with the predicted_should_show
column, while predicted_should_show
is green/red if it should be shown or not.
I also developed a random forest model for RNAcentral that can determine which rRNA structures should be shown on our pages and which should not. The model takes into account several factors:
- the model source (CRW, Rfam, etc)
- The sequence length
- The length of the resulting diagram
- The length of the model
- The number of basepairs in the model
- The number of basepairs in the diagram
- The length of the model matched by the diagram
- The number of overlaps in the diagram
It does well enough on rRNA diagrams but not as well on Rfam or tRNA. It may be because I didn't train it on enough examples, or because I didn't include some important feature for these sources. The model may not do well with large insertions into otherwise nice diagrams.