R2DT Add method to determine the quality of a diagram

Add method to determine the quality of a diagram

Open blakesweeney opened this issue 3 years ago • 0 comments

R2DT can produce diagrams which are either not visually pleasing or have some issue with the secondary structure. Visually this can mean has a lot of overlaps (https://rnacentral.org/rna/URS000000AD71/1828414), or the diagram has helices look like they should be placed closer to each other than they are (https://rnacentral.org/rna/URS00009F9E0C/358574), or other issues. From the pairing side this often means an underfolded structure (https://rnacentral.org/rna/URS0000004EA0/2261). Sometimes this could be because it is a small fraction of the whole model (though other times it is great that R2DT works with such cases) or the a small part of the whole sequence (though again not always). I think work on this issue will have to span both the display and folding parts.

For this issue I would like to have R2DT produce fewer diagrams that have visual issues and indicate to users when the secondary structure itself has some issue. I think this is something @nawrockie and @davidhoksza could help with.

Current status

I have collected examples of these issues at: https://www.google.com/url?q=https://docs.google.com/spreadsheets/d/1FwVJ3p457e5-r4GPeOkZO7avH3pe_RAo9ji8zvHulmw/edit?usp%3Dsharing&sa=D&source=editors&ust=1624436833207000&usg=AOvVaw0LOkgNqMM_tYW-JMHkPcd4. The spreadsheet shows the diagrams that have issues and those that do not. The predicted_should_show column result of a model while the Labeled Should show is what I thought the diagram should be. If Labeled Should show is green that means it agrees with the predicted_should_show column, while predicted_should_show is green/red if it should be shown or not.

I also developed a random forest model for RNAcentral that can determine which rRNA structures should be shown on our pages and which should not. The model takes into account several factors:

the model source (CRW, Rfam, etc)
The sequence length
The length of the resulting diagram
The length of the model
The number of basepairs in the model
The number of basepairs in the diagram
The length of the model matched by the diagram
The number of overlaps in the diagram

It does well enough on rRNA diagrams but not as well on Rfam or tRNA. It may be because I didn't train it on enough examples, or because I didn't include some important feature for these sources. The model may not do well with large insertions into otherwise nice diagrams.

Jun 23 '21 08:06 blakesweeney

R2DT R2DT copied to clipboard

Add method to determine the quality of a diagram

Current status

R2DT
R2DT copied to clipboard