RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

About buildSummary.pl

Open EugeneKim76 opened this issue 4 years ago • 7 comments

Deal all,

I'm now investigating repeat of a plant genome. I summarized RepeatMasker results using buildSummary.pl, as shown below. buildSummary.pl myresult.out

However, the result was somewhat different from the result obtained by repeatmasker (I mean *tbl file) For example, there was slight? difference, as shown below. total interspersed 1132417 1148692629 70.85% Total interspersed repeats: 1153699947 bp 71.16 % -> from *tbl file

How can I understand the difference between them? Which one is better?

Best,

EugeneKim76 avatar Jul 16 '20 09:07 EugeneKim76

It does seem odd that these are two different values. It looks like it will take a long hard look at the code to be certain, but I have two ideas of why this might have happened:

  • Overlapping annotations, for example where two fragments of different interspersed repeats run into each other. These base pairs might be counted toward one, or the other, or both classes of repeats depending on which report or table you are looking at. This is a difficult problem, and it's easy to make mistakes or simply disagree about the right way to count them.
    • For the same reason, you cannot always add individual counts together to reach totals.
  • Mismatches between scripts on what is counted under "interspersed repeats". For example, simple repeats and low-complexity regions are not counted as interspersed repeats. It may be that one script disagrees with the other about one of these categories.

I checked some previous results and the values do match between those two tools, so any more details you can give about your data may help track this down - e.g. if you are using a custom -lib or RepBase RepeatMasker Edition, and the .tbl and .out files if they can be posted publicly or shared with us via email.

jebrosen avatar Jul 17 '20 04:07 jebrosen

Thanks for Quick reply. Since my data is not published, I will show the result from buildSummary.pl using public data (http://www.repeatmasker.org/species/euaEut.html). The result was produced, using command line shown below. ./buildSummary.pl euaEut74.fa.out > test.tbl

However, there were slight differences between my result and the result on http://www.repeatmasker.org/species/euaEut.html. For example, L2 408719 112228109 6.14% -> my result image

Did I miss any process? Which script is used for produce final tbl file during repeatmasking process?

Results from 'buildSummary.pl ' are shown below,

image image

EugeneKim76 avatar Jul 17 '20 06:07 EugeneKim76

My apologies for the delayed response. Unfortunately this comparison to those published results is not very informative, because of the time difference. I would actually be surprised if you got the same results, because bugs have been fixed and classifications have been updated since 2014.


Which script is used for produce final tbl file during repeatmasking process?

ProcessRepeats generates the .tbl file, and it can be re-run individually:

ProcessRepeats [-options] output.cat

Don't forget to add the -species or -lib parameters if you used them with RepeatMasker originally.

One thing that might help narrow the problem down, is if a particular classification of interspersed repeats is counted differently between the two results.

jebrosen avatar Aug 07 '20 16:08 jebrosen

Hello @jebrosen ,

I conducted de novo TE analyses for some plant genomes using RepeatModeler2 and then passed the custom libraries using -lib to RepeatMasker 4.1.1. However, I am not sure about the classification (superfamily/family) of the elements. For instance, in the .tbl default output file, I have Gypsy/DIRS1. However, the DIRS1 element is one group of LTR retrotransposon that is rarely found in plants. I was expecting to see Ty3/Gypsy. Although, I have seen Gypsy/DIRS1 in plant genome publications, would it be appropriate to publish this type of pairing?

I also used the buildSummary.pl script to obtain the full list of annotated TEs but noticed the table contained repeats not found in plants (Crypton-V, Dada, Sola-2, penelope, ERVK, bhikhari etc). What recommendations would you have for handling this?

Thanks

Alexdami17 avatar Jan 16 '21 02:01 Alexdami17

Hi @Alexdami17,

For instance, in the .tbl default output file, I have Gypsy/DIRS1. However, the DIRS1 element is one group of LTR retrotransposon that is rarely found in plants. I was expecting to see Ty3/Gypsy.

Yes, fixing this is on our wishlist but we have sadly not tackled it yet. The classifications and labels in the .tbl file are from a hand-edited list from years ago, which needs to be updated to match new knowledge and to apply to a broader range of species. The buildSummary.pl output is more directly based on the classifications in the search library, and does not make any assumptions about which groups of elements are expected to be present.

I also used the buildSummary.pl script to obtain the full list of annotated TEs but noticed the table contained repeats not found in plants (Crypton-V, Dada, Sola-2, penelope, ERVK, bhikhari etc). What recommendations would you have for handling this?

This could be mis-classifications by RepeatClassifier. Classifications are informed in part by sequence similarity to known TE proteins, and there could be false positive matches or a lack of good matches in the protein library. Were these widespread, or only for a few elements?

jebrosen avatar Jan 21 '21 23:01 jebrosen

Thanks for your response @jebrosen. It is only for few elements, sometimes present in one or two of the plant genomes.

In my first analysis with RepeatModeler2.0 and Repeatmasker4.0.7 on one computer, the element Bhikhari was found in one of the plant genomes (164 count, 93231 bp masked, 0.02% masked). When I ran the analysis again using RepeatModeler2.0 and RepeatMasker4.1.1 (with buildsummary.pl) on another computer, the element Bhikari was found once again in the same genome (170 count, 93763 bp masked, 0.02% masked). There seems to be consistency in both analyses, however, the LTR element Bhikhari is found in Zebrafish and rarely in plants. Similar for Crypton-V, Sola-2, kolobok_hyrdra and Dada. I think they are all false positives since they are rarely found in plants. Do you have suggestions on how to address these false positives or mis-classifications by RepeatClassifier?

Also, I am puzzled why there are two versions of a DNA element, which was consistent across all genomes (analyses carried out with RepeatModeler2.0 and RepeatMasker4.1.1). One as MULE-MuDR and the other as MuLE-MuDR.

Alexdami17 avatar Jan 24 '21 06:01 Alexdami17

Do you have suggestions on how to address these false positives or mis-classifications by RepeatClassifier?

@Alexdami17 In general RepeatClassifier's output is "best effort" and may need manual correction in some cases, as with the elements themselves which may be over- or under-extended depending on the type of repeat, assembly quality, and genome characteristics. One way to do this would be to compare the obviously mis-classified sequences against existing plant TE or TE protein libraries to find any better matches.

If you can share your output files or a public assembly of the genome in question, we may be able to use that to help improve RepeatClassifier results if there is a problem in the software. It is also likely that we simply need to continue to incorporate more TE proteins from plants to our libraries in order for those classifications to be better in the first place.

jebrosen avatar Jan 25 '21 17:01 jebrosen